Need awk to extract lines and sort

Hi,
My data looks like this.

 CHR                           SNP         BP   A1       TEST    NMISS         OR         STAT            P
   0                 SNP_A-8282315          0    2        ADD     1530      1.074       0.7707       0.4409
   0                 SNP_A-8282315          0    2       COV1     1530  1.771e+04        4.764    1.898e-06
   0                 SNP_A-8282315          0    2       COV2     1530  1.513e+04        4.645    3.402e-06
   0                 SNP_A-8282315          0    2       COV3     1530      14.16        1.306       0.1915
   0                 SNP_A-8282315          0    2       COV4     1530      1.264       0.1139       0.9093
   0                 SNP_A-8282315          0    2       COV5     1530      2.389       0.4268       0.6695
   0                 SNP_A-8338258          0    4        ADD     1528     0.9498      -0.6824        0.495
   0                 SNP_A-8338258          0    4       COV1     1528  1.846e+04        4.777    1.783e-06
   0                 SNP_A-8338258          0    4       COV2     1528  1.374e+04          4.6    4.224e-06
   0                 SNP_A-8338258          0    4       COV3     1528      14.82         1.33       0.1836
   0                 SNP_A-8338258          0    4       COV4     1528      1.251       0.1087       0.9134
   0                 SNP_A-8338258          0    4       COV5     1528      2.431       0.4354       0.6633

.
.

  1. I want to extract only lines with "ADD" in col. 5 (TEST).
  2. I want to sort ascending on Col 9 (P). This column as shown above has both regular and scientific formats.
  3. How can I keep the headers of the file intact in the outfile?

What I tried:

gawk '{if(NR>1 && $5="ADD") {print $0}}' infile>outfile

Everything in col5 is changed to ADD. :frowning:

That's because you assigned the value "ADD" to the fifth field ($5). You will have to compare the fifth field ($5) with "ADD" instead.

=> Operator for assignment is "=". It changes the value.
=> Operator for comparison is "==". It doesn't change the value.

$
$ cat f1
 CHR                           SNP         BP   A1       TEST    NMISS         OR         STAT            P
   0                 SNP_A-8282315          0    2        ADD     1530      1.074       0.7707       0.4409
   0                 SNP_A-8282315          0    2       COV1     1530  1.771e+04        4.764    1.898e-06
   0                 SNP_A-8282315          0    2       COV2     1530  1.513e+04        4.645    3.402e-06
   0                 SNP_A-8282315          0    2       COV3     1530      14.16        1.306       0.1915
   0                 SNP_A-8282315          0    2       COV4     1530      1.264       0.1139       0.9093
   0                 SNP_A-8282315          0    2       COV5     1530      2.389       0.4268       0.6695
   0                 SNP_A-8338258          0    4        ADD     1528     0.9498      -0.6824        0.495
   0                 SNP_A-8338258          0    4       COV1     1528  1.846e+04        4.777    1.783e-06
   0                 SNP_A-8338258          0    4       COV2     1528  1.374e+04          4.6    4.224e-06
   0                 SNP_A-8338258          0    4       COV3     1528      14.82         1.33       0.1836
   0                 SNP_A-8338258          0    4       COV4     1528      1.251       0.1087       0.9134
   0                 SNP_A-8338258          0    4       COV5     1528      2.431       0.4354       0.6633
$
$ gawk '{if(NR>1 && $5=="ADD") {print $0}}' f1
   0                 SNP_A-8282315          0    2        ADD     1530      1.074       0.7707       0.4409
   0                 SNP_A-8338258          0    4        ADD     1528     0.9498      -0.6824        0.495
$
$

Rookie programmer mistake. :wink:

In fact, since printing the line is the default action of awk and its variants, you could slim down your one-liner to:

gawk '$5=="ADD"' <filename>

tyler_durden

Hey! Thanks for pointing it out.
I corrected it and worked!
Any ideas on how to sort that mixed data type col9.?
I also want to keep that header intact. But cant figure why NR>1 doesnt work it to keep it.
Thanks

Hi,

Try this....

awk '{if(NR==1) {print $0}};{if(NR>1 && $5=="ADD") {print $0}}' test_forum | sort -nk9

And a bit shorter version:

awk 'NR==1 || $5=="ADD"' filename | sort -nk9

tyler_durden

OP requires to sort on both regular and scientific format. The g modifier would then be more appropriate.

 ... | sort -k9,9g

Hi,
That worked!
Thanks to durden_tyler, pravin27 and ripat.
Once again, thanks for helping me out. I am not a programmer, but a biologist, kind of masquerading as a bioinformatician; purely due to necessity rather than choice.
This forum has been of so much use to me.

awk 'NR==1 || $5=="ADD"' 

Can you please tell me what the two pipes do?
Is that the (OR) operator?

These are not pipes but the logical operator for OR. In pseudo code this would mean:

if line number equal 1 OR 5th field equal "ADD" ...

Yep,

NR==1

means the first record

$5=="ADD

Means if field five equals "ADD"

awk '$5=="ADD"' a.txt | sort -n +8

Genehunter, i might be wrong but when i see your input file, it seems you are getting the data from the database (tables). If so, then all your requirements can be done in the query which fetches the data from the tables itself...

Actually they are not from a db. These are output files from genetic analysis. But thaks for that thought. Perhaps I should try a database to get around some of these quirks.:confused: but awk has been awesome...! :slight_smile: