Dear Unix experts,
I have got a file where I would like to compare the values of second column if first column is same in such a way that the difference between the values is >50. If not, I would like to discard both values.
For example, my input file looks like -
comp275_c0_seq2 73
comp275_c0_seq2 76
comp275_c0_seq2 85
comp275_c0_seq2 105
comp569_c0_seq1 117
comp569_c0_seq1 208
comp569_c0_seq1 328
where, for column 2, difference between row 'two and one' and 'three and two' and 'four and three' is less than 50. So, I would like to discard those entries.
For column 1, difference between rows 'six and seven' and 'eight and seven; is greater than 50 and hence keep them. So, my desired output will be
comp569_c0_seq1 117
comp569_c0_seq1 208
comp569_c0_seq1 328
Any help you can provide is highly appreciated.
Many thanks
Is this a homework assignment?
What have you tried?
Sorry I didn't post my code earlier.
awk 'NR % 2 != 0 {x=$1; y=$2} NR % 2 == 0 {if ($2 - y > 50){print x,y}}' test1
where test1 is the input file I described above.
It only gives me
comp569_c0_seq1 117
as output.
I repeat: Is this a homework assignment?
No, its not. I am trying to analyse some large dataset and really naive to shell programming
If I understand your requirements correctly, you could try something like:
awk '
{ if($1 != l1 || $2 - l2 > 50) {
if(lp) print l1, l2
lp = 1
} else lp = 0
l1 = $1
l2 = $2
}
END { if(lp) print l1, l2
}' input
If you want to run this on a Solaris/SunOS system use /usr/xpg4/bin/awk
, /usr/xpg6/bin/awk
, or nawk
instead of the default /usr/bin/awk
.
Your description and sample data didn't indicate what should happen if there is only one line for a given field 1 value. This script will print those lines because there is no other line with that field 1 value that has a value in field 2 that is within 50 points of it. For example, with the input file:
comp275_c0_seq2 73
comp275_c0_seq2 76
comp275_c0_seq2 85
comp275_c0_seq2 105
comp569_c0_seq1 117
comp569_c0_seq1 208
comp569_c0_seq1 328
added_set_1 1
added_set_1 100
added_set_1 150
added_set_1 200
added_set_1 251
added_set_1 302
added_set_1 322
added_single 1
the output produced is:
comp569_c0_seq1 117
comp569_c0_seq1 208
comp569_c0_seq1 328
added_set_1 1
added_set_1 251
added_single 1
Hello,
Following may help too in same.
awk '{if(f == $1){c=$2 - a}} {f=$1;a=$2} {if(c>50) print $0 OFS "("c")"}' get_differ_check121132
Output will be as follows.
comp569_c0_seq1 208 (91)
comp569_c0_seq1 328 (120)
NOTE: Where get_differ_check121132 is the input file name.
Thanks,
R. Singh