How to compare the values of a column in a same file using awk?

utritala · February 14, 2014, 7:43am

Dear Unix experts,

I have got a file where I would like to compare the values of second column if first column is same in such a way that the difference between the values is >50. If not, I would like to discard both values.

For example, my input file looks like -

comp275_c0_seq2 73
comp275_c0_seq2 76
comp275_c0_seq2 85
comp275_c0_seq2 105
comp569_c0_seq1 117
comp569_c0_seq1 208
comp569_c0_seq1 328

where, for column 2, difference between row 'two and one' and 'three and two' and 'four and three' is less than 50. So, I would like to discard those entries.
For column 1, difference between rows 'six and seven' and 'eight and seven; is greater than 50 and hence keep them. So, my desired output will be

comp569_c0_seq1 117
comp569_c0_seq1 208
comp569_c0_seq1 328

Any help you can provide is highly appreciated.

Many thanks

Don_Cragun · February 14, 2014, 7:47am

Is this a homework assignment?

What have you tried?

utritala · February 14, 2014, 7:57am

Sorry I didn't post my code earlier.

awk 'NR % 2 != 0 {x=$1; y=$2} NR % 2 == 0 {if ($2 - y > 50){print x,y}}' test1

where test1 is the input file I described above.

It only gives me

comp569_c0_seq1 117

as output.

Don_Cragun · February 14, 2014, 8:01am

I repeat: Is this a homework assignment?

utritala · February 14, 2014, 8:03am

No, its not. I am trying to analyse some large dataset and really naive to shell programming

Don_Cragun · February 14, 2014, 9:17am

If I understand your requirements correctly, you could try something like:

awk '
{       if($1 != l1 || $2 - l2 > 50) {
                if(lp) print l1, l2
                lp = 1
        } else  lp = 0
        l1 = $1 
        l2 = $2 
}
END {   if(lp) print l1, l2
}' input

If you want to run this on a Solaris/SunOS system use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of the default /usr/bin/awk .

Your description and sample data didn't indicate what should happen if there is only one line for a given field 1 value. This script will print those lines because there is no other line with that field 1 value that has a value in field 2 that is within 50 points of it. For example, with the input file:

comp275_c0_seq2 73
comp275_c0_seq2 76
comp275_c0_seq2 85
comp275_c0_seq2 105
comp569_c0_seq1 117
comp569_c0_seq1 208
comp569_c0_seq1 328
added_set_1 1
added_set_1 100 
added_set_1 150 
added_set_1 200 
added_set_1 251
added_set_1 302 
added_set_1 322 
added_single 1

the output produced is:

comp569_c0_seq1 117
comp569_c0_seq1 208
comp569_c0_seq1 328
added_set_1 1
added_set_1 251
added_single 1

utritala · February 17, 2014, 4:08am

Thank you for your help.

RavinderSingh13 · April 9, 2014, 3:51am

Hello,

Following may help too in same.

awk '{if(f == $1){c=$2 - a}} {f=$1;a=$2} {if(c>50) print $0 OFS "("c")"}' get_differ_check121132

Output will be as follows.

comp569_c0_seq1 208 (91)
comp569_c0_seq1 328 (120)

NOTE: Where get_differ_check121132 is the input file name.

Thanks,
R. Singh