Hi,
My data looks like this.
CHR SNP BP A1 TEST NMISS OR STAT P
0 SNP_A-8282315 0 2 ADD 1530 1.074 0.7707 0.4409
0 SNP_A-8282315 0 2 COV1 1530 1.771e+04 4.764 1.898e-06
0 SNP_A-8282315 0 2 COV2 1530 1.513e+04 4.645 3.402e-06
0 SNP_A-8282315 0 2 COV3 1530 14.16 1.306 0.1915
0 SNP_A-8282315 0 2 COV4 1530 1.264 0.1139 0.9093
0 SNP_A-8282315 0 2 COV5 1530 2.389 0.4268 0.6695
0 SNP_A-8338258 0 4 ADD 1528 0.9498 -0.6824 0.495
0 SNP_A-8338258 0 4 COV1 1528 1.846e+04 4.777 1.783e-06
0 SNP_A-8338258 0 4 COV2 1528 1.374e+04 4.6 4.224e-06
0 SNP_A-8338258 0 4 COV3 1528 14.82 1.33 0.1836
0 SNP_A-8338258 0 4 COV4 1528 1.251 0.1087 0.9134
0 SNP_A-8338258 0 4 COV5 1528 2.431 0.4354 0.6633
.
.
I want to extract only lines with "ADD" in col. 5 (TEST).
I want to sort ascending on Col 9 (P). This column as shown above has both regular and scientific formats.
How can I keep the headers of the file intact in the outfile?
What I tried:
gawk '{if(NR>1 && $5="ADD") {print $0}}' infile>outfile
Everything in col5 is changed to ADD.
That's because you assigned the value "ADD" to the fifth field ($5). You will have to compare the fifth field ($5) with "ADD" instead.
=> Operator for assignment is "=". It changes the value.
=> Operator for comparison is "==". It doesn't change the value.
$
$ cat f1
CHR SNP BP A1 TEST NMISS OR STAT P
0 SNP_A-8282315 0 2 ADD 1530 1.074 0.7707 0.4409
0 SNP_A-8282315 0 2 COV1 1530 1.771e+04 4.764 1.898e-06
0 SNP_A-8282315 0 2 COV2 1530 1.513e+04 4.645 3.402e-06
0 SNP_A-8282315 0 2 COV3 1530 14.16 1.306 0.1915
0 SNP_A-8282315 0 2 COV4 1530 1.264 0.1139 0.9093
0 SNP_A-8282315 0 2 COV5 1530 2.389 0.4268 0.6695
0 SNP_A-8338258 0 4 ADD 1528 0.9498 -0.6824 0.495
0 SNP_A-8338258 0 4 COV1 1528 1.846e+04 4.777 1.783e-06
0 SNP_A-8338258 0 4 COV2 1528 1.374e+04 4.6 4.224e-06
0 SNP_A-8338258 0 4 COV3 1528 14.82 1.33 0.1836
0 SNP_A-8338258 0 4 COV4 1528 1.251 0.1087 0.9134
0 SNP_A-8338258 0 4 COV5 1528 2.431 0.4354 0.6633
$
$ gawk '{if(NR>1 && $5=="ADD") {print $0}}' f1
0 SNP_A-8282315 0 2 ADD 1530 1.074 0.7707 0.4409
0 SNP_A-8338258 0 4 ADD 1528 0.9498 -0.6824 0.495
$
$
Rookie programmer mistake.
In fact, since printing the line is the default action of awk and its variants, you could slim down your one-liner to:
gawk '$5=="ADD"' <filename>
tyler_durden
Hey! Thanks for pointing it out.
I corrected it and worked!
Any ideas on how to sort that mixed data type col9.?
I also want to keep that header intact. But cant figure why NR>1 doesnt work it to keep it.
Thanks
Hi,
Try this....
awk '{if(NR==1) {print $0}};{if(NR>1 && $5=="ADD") {print $0}}' test_forum | sort -nk9
And a bit shorter version:
awk 'NR==1 || $5=="ADD"' filename | sort -nk9
tyler_durden
ripat
October 11, 2009, 3:19am
6
OP requires to sort on both regular and scientific format. The g modifier would then be more appropriate.
... | sort -k9,9g
Hi,
That worked!
Thanks to durden_tyler, pravin27 and ripat.
Once again, thanks for helping me out. I am not a programmer, but a biologist, kind of masquerading as a bioinformatician; purely due to necessity rather than choice.
This forum has been of so much use to me.
awk 'NR==1 || $5=="ADD"'
Can you please tell me what the two pipes do?
Is that the (OR) operator?
ripat
October 11, 2009, 1:55pm
8
genehunter:
Hi,
That worked!
Thanks to durden_tyler, pravin27 and ripat.
Once again, thanks for helping me out. I am not a programmer, but a biologist, kind of masquerading as a bioinformatician; purely due to necessity rather than choice.
This forum has been of so much use to me.
awk 'NR==1 || $5=="ADD"'
Can you please tell me what the two pipes do?
Is that the (OR) operator?
These are not pipes but the logical operator for OR. In pseudo code this would mean:
if line number equal 1 OR 5th field equal "ADD" ...
Yep,
NR==1
means the first record
$5=="ADD
Means if field five equals "ADD"
awk '$5=="ADD"' a.txt | sort -n +8
Genehunter, i might be wrong but when i see your input file, it seems you are getting the data from the database (tables). If so, then all your requirements can be done in the query which fetches the data from the tables itself...
mac4rfree:
Genehunter, i might be wrong but when i see your input file, it seems you are getting the data from the database (tables). If so, then all your requirements can be done in the query which fetches the data from the tables itself...
Actually they are not from a db. These are output files from genetic analysis. But thaks for that thought. Perhaps I should try a database to get around some of these quirks. but awk has been awesome...!