The first column of the data is an ID, the second column of the data is a prediction score and the third column is the actual score.
I want to organize a confusion matrix with this data; In which those columns 2 and 3 that contain 1 and 1 are considered "TP" (true positive), those columns 2 and 3 that contain 0 and 0 are considered "TN" (true negative), those columns 2 that have a 1 and column 3 that have a 0 are considered "FP" (false positive) and those column 2 that have a 0 and column 3 that have a 1 are considered "FN" (false negative).
Considering the above data the result would be as follows
TP 2
TN 3
FN 1
FP 2
Is there a grep that can help me to achieve this result?
Thank you very much!
No, there isn't: grep is for filtering lines according to some rules, usually a regexp. What grep can do is: return all lines which exhibit a certain pattern. What grep cannot do: summarize content.
Fortunately there are other means of text processing which can indeed deliver what you want (replace <t> with a literal tab in the following). Notice that the script is "barebone", no effort is spent on runtime security, error detection, etc., ...):
#! /bin/ksh
typeset -i iTP=0
typeset -i iTN=0
typeset -i iFP=0
typeset -i iFN=0
typeset chTitle=""
typeset -i iPred=0
typeset -i iReal=0
typeset fIn="/path/to/your/input.file"
while IFS='<t>' read chTitle iPred iReal ; do
if [ $iPred -eq 0 ] ; then
if [ $iReal -eq 0 ] ; then
(( iTN += 1 ))
else
(( iFN += 1 ))
fi
else
if [ $iReal -eq 0 ] ; then
(( iFP += 1 ))
else
(( iTP += 1 ))
fi
fi
done < "$fIn"
print - "True Positives : $iTP"
print - "True Negatives : $iTN"
print - "False Positives: $iFP"
print - "False Negatives: $iFN"
exit 0