Grep document according to values

owwow14 · February 19, 2014, 6:44am

Hi,
I have the following data that is 3-col, tab separated and looks something like this:

inscription	1	1
ionosphere	0	0
magnate	0	1
majesty	1	0
meritocracy	0	0
monarchy	0	0
monkey	1	0
notepaper	1	1

The first column of the data is an ID, the second column of the data is a prediction score and the third column is the actual score.

I want to organize a confusion matrix with this data; In which those columns 2 and 3 that contain 1 and 1 are considered "TP" (true positive), those columns 2 and 3 that contain 0 and 0 are considered "TN" (true negative), those columns 2 that have a 1 and column 3 that have a 0 are considered "FP" (false positive) and those column 2 that have a 0 and column 3 that have a 1 are considered "FN" (false negative).

Considering the above data the result would be as follows

TP 2
TN 3
FN 1
FP 2

Is there a grep that can help me to achieve this result?
Thank you very much!

bakunin · February 19, 2014, 7:03am

No, there isn't: grep is for filtering lines according to some rules, usually a regexp. What grep can do is: return all lines which exhibit a certain pattern. What grep cannot do: summarize content.

Fortunately there are other means of text processing which can indeed deliver what you want (replace <t> with a literal tab in the following). Notice that the script is "barebone", no effort is spent on runtime security, error detection, etc., ...):

#! /bin/ksh

typeset -i iTP=0
typeset -i iTN=0
typeset -i iFP=0
typeset -i iFN=0
typeset    chTitle=""
typeset -i iPred=0
typeset -i iReal=0

typeset    fIn="/path/to/your/input.file"

while IFS='<t>' read chTitle iPred iReal ; do
     if [ $iPred -eq 0 ] ; then
          if [ $iReal -eq 0 ] ; then
               (( iTN += 1 ))
          else
               (( iFN += 1 ))
          fi
     else
          if [ $iReal -eq 0 ] ; then
               (( iFP += 1 ))
          else
               (( iTP += 1 ))
          fi
     fi
done < "$fIn"

print - "True Positives : $iTP"
print - "True Negatives : $iTN"
print - "False Positives: $iFP"
print - "False Negatives: $iFN"

exit 0

I hope this helps.

bakunin

Lucas_0418 · February 19, 2014, 9:31am

If you do not mind to grep four times, this may could help you.

TP=`grep -c "1[[:blank:]]\{1,\}1$" infile`
TN=`grep -c "0[[:blank:]]\{0,\}0$" infile`
FN=`grep -c "0[[:blank:]]\{0,\}1$" infile`
FP=`grep -c "1[[:blank:]]\{0,\}0$" infile`
echo TP" $TP";echo TN" $TN";echo FN" $FN";echo FP" $FP"

Yoda · February 19, 2014, 5:09pm

An awk approach:

awk -F'\t' '
        {
                TP += ( $2 == 1 && $3 == 1 ) ? 1 : 0
                TN += ( $2 == 0 && $3 == 0 ) ? 1 : 0
                FP += ( $2 == 1 && $3 == 0 ) ? 1 : 0
                FN += ( $2 == 0 && $3 == 1 ) ? 1 : 0
        }
        END {
                print "TP", TP
                print "TN", TN
                print "FP", FP
                print "FN", FN
        }
' file

ahamed101 · February 23, 2014, 9:16pm

Another way...

awk '{a[$2,$3]++} END { printf "TP:%d\nTN:%d\nFP:%d\nFN:%d\n",a[1,1],a[0,0],a[1,0],a[0,1] }' infile

--ahamed