awk : extracting unique lines based on columns

genehunter · April 30, 2010, 11:23pm

Hi,

snp.txt

CHR_A   SNP_A           BP_A_st         BP_A_End        CHR_B   BP_B            SNP_B           R2              p-SNP_A         p-SNP_B
5       rs1988728       74904317        74904318        5       74960646        rs1427924       0.377333        0.000740085     0.013930081
5       rs1988728       74904317        74904318        5       74960918        rs9293656       0.370860 0.000740085     0.00939958
1       rs268955        30166376        30166377        1       30286312        rs12145453      0.015673        0.000740425     0.008207172
1       rs268955        30166376        30166377        1       30289115        rs12142520      0.0120884       0.000740425     0.045320982
19      rs6510185       36070251        36070252        19      36263387        rs11673246      0.0105482       0.000740565     0.034650246
19      rs6510185       36070251        36070252        19      36115734        rs17571341      0.00406461      0.000740565     0.015351578
19      rs6510185       36070251        36070252        19      36267571        rs11880163      0.00040869      0.000740565     0.016354903
5       rs5744566       74866563        74866564        5       74913022        rs3213801       0.385063        0.000740641     0.018259986
5       rs5744566       74866563        74866564        5       74955165        rs6861279       0.380825        0.000740641     0.014054183

Making sure that col 2 is unique.
Output

CHR_A   SNP_A           BP_A_st         BP_A_End        CHR_B   BP_B            SNP_B           R2              p-SNP_A         p-SNP_B
5       rs1988728       74904317        74904318        5       74960646        rs1427924       0.377333        0.000740085     0.013930081
1       rs268955        30166376        30166377        1       30286312        rs12145453      0.015673        0.000740425     0.008207172
19      rs6510185       36070251        36070252        19      36263387        rs11673246      0.0105482       0.000740565     0.034650246
5       rs5744566       74866563        74866564        5       74913022        rs3213801       0.385063        0.000740641     0.018259986

Would like to get a solution with awk or shell.
Thanks

itkamaraj · May 1, 2010, 12:00am

check whether it works for you

for i in `cat uniq.txt | awk '{print $2}' |sort| uniq`;do grep -m 1 $i uniq.txt;done | sort -r

pseudocoder · May 1, 2010, 1:18am

$ awk 'a !~ $2; {a=$2}' snp.txt

itkamaraj · May 1, 2010, 1:22am

please explain the logic

what is a !~ $2

thanks
kamaraj

genehunter · May 1, 2010, 1:34am

I am sorry, but that did not work.
When I tried with the example above, both codes gave me correct output.
But when I applied it to my actual datafile of 289727 lines
itkamaraj code gave 56446 lines
pseudocoder gave 57747 lines.

itkamaraj · May 1, 2010, 1:46am

is my solution works ?

genehunter · May 1, 2010, 1:50am

please see post above.

pseudocoder · May 1, 2010, 2:25am

Nope
I've found it on the net some time ago and it simply emulates uniq, applied on the 2nd column.
Create a testfile with following content and play with it, try $0, $1, and $2 and see the difference:

abc 000
abc 123
abc 123
sdf 234
fds 234
jkl 999
lkj 222
lkj 222
qwe 443
rew 323

genehunter,
how can you know that it didn't work? Just because our results differ?
I guess without having the original file it will be hard to find the error.
You will have to copy out e.g. the first 500 lines off the datafile in a tempfile and see if the results differ.
Maybe you can then attach that tempfile to your posting, so we all can examine it and try to find the error (if there is any ;))

malcomex999 · May 1, 2010, 2:33am

Try this...

awk '!arr[$2]++' infile

pseudocoder · May 1, 2010, 2:41am

That simply means, just like "my" suggestion, print if not already in array, right?

Scrutinizer · May 1, 2010, 3:13am

Solution #2 implicitly assumes that identical occurrences in column2 are adjacent, which is possibly not the case. This would explain the higher outcome. This can be checked by sorting first, e.g.:

sort -k2,2 snp.txt | awk 'a !~ $2; {a=$2}'

ahmad.diab · May 1, 2010, 8:50am

another solution:-

nawk '! _[$2]++' infile.txt

the above solution doesn't need pipe or sed. even if the identical rows (same 2nd colom) are not after each others.

BR

---------- Post updated at 15:50 ---------- Previous update was at 15:30 ----------

also in perl use below:-

perl  -lane 'print if ( ! $h{$F[0]}++) ;' snp.txt

genehunter · May 1, 2010, 1:43pm

Upon printing just the col2 from the file and sorting uniq, I found that both malcolmex999 and pseudocoder codes give same results.
I further looked at the first few lines of the col2 from results (without uniq) and found that kamaraj code shows duplicates
Ahmead.diab your code shows same results as malcolmex999 and pseudocoder. Thanks!

Final_Apr30 :~>head kamarajtestsnpuniq
SNP_A
rs996312
rs9942844
rs9942844
rs990327
rs988961
rs987824
rs976263
rs976263
rs976240
Final_Apr30 :~>head malcomxsnpuniq
SNP_A
rs7259854
rs2981575
rs11150978
rs11200014
rs2981579
rs1078806
rs1219648
rs6590505
rs6590504
Final_Apr30 :~>head pesudocodersnpuniq
SNP_A
rs7259854
rs2981575
rs11150978
rs11200014
rs2981579
rs1078806
rs1219648
rs6590505
rs6590504

line count
    56446 malcomx
    56446 malcomxsnpuniq
    57747 pesudocoder
    57747 pesudocodersnpuniq
    56446 kamarajtest
    24657 kamarajtestuniq

Hope this helps clarify.
Thank you all for your help and patience.
~GH
"Stand Up to Cancer!"