Hello all, I am not a programmer, but I require a little help with a project I am doing. I did read several posts and looks like awk or python may help me, though I know very little about using them. Here is my question: I have first file with 6 column.
CHR SNP A1 A2 MAF NCHROBS
0 SNP_A-8414268 A G 0.1522 5354
1 rs12565286 C G 0.04139 5340
1 rs2980319 A T 0.1503 5362
1 rs2980300 T C 0.1773 5346
1 rs6603781 A G 0.1149 5346
My second file is very similar to the first one, but it may or may not have the same Column 2(SNP). I suspect that the columns Col 3 and 4 (A1 & A2) may be different as well.
What I require is to get an output file with columns 1,2,3, 4, 5, 6 from the first file and the corresponding line that matches column 2 (SNP) of the FIRST file with the columns 2,3,4,5,6 (SNP, A1, A2, MAF, NCHROBS) from SECOND file at positions 7,8,9,10,11,2. The output file will hence have 11 columns; the first 6 from file 1.txt and the matching last five from file 2.txt
CHR SNP A1 A2 MAF NCHROBS
0 SNP_A-8414268 A G 0.1522 5354
1 rs12565286 C G 0.04139 5340
1 rs2980319 A T 0.1503 5362 rs2980319 A T 0.1503 4252
1 rs2980300 T C 0.1773 5346 rs2980300 T C 0.1273 4546
1 rs6603781 A G 0.1149 5346 rs6603781 G A 0.0249 4546
$ cat f1
CHR SNP A1 A2 MAF NCHROBS
0 SNP_A-8414268 A G 0.1522 5354
1 rs12565286 C G 0.04139 5340
1 rs2980319 A T 0.1503 5362
1 rs2980300 T C 0.1773 5346
1 rs6603781 A G 0.1149 5346
$ cat f2
CHR SNP A1 A2 MAF NCHROBS
1 rs2980319 A T 0.1503 4252
1 rs2980300 T C 0.1273 4546
1 rs6603781 G A 0.0249 4546
Try this:
$ awk 'NR==FNR{k[$2]=sprintf(" %s %s %s %s %s",$2,$3,$4,$5,$6);next}{print $1,$2,$3,$4,$5,$6 k[$2]}' f2 f1
CHR SNP A1 A2 MAF NCHROBS SNP A1 A2 MAF NCHROBS
0 SNP_A-8414268 A G 0.1522 5354
1 rs12565286 C G 0.04139 5340
1 rs2980319 A T 0.1503 5362 rs2980319 A T 0.1503 4252
1 rs2980300 T C 0.1773 5346 rs2980300 T C 0.1273 4546
1 rs6603781 A G 0.1149 5346 rs6603781 G A 0.0249 4546
$
$ cat file1
CHR SNP A1 A2 MAF NCHROBS
0 SNP_A-8414268 A G 0.1522 5354
1 rs12565286 C G 0.04139 5340
1 rs2980319 A T 0.1503 5362
1 rs2980300 T C 0.1773 5346
1 rs6603781 A G 0.1149 5346
$
$
$ cat file2
CHR SNP A1 A2 MAF NCHROBS
0 SMP_A-8414268 A G 0.1522 5354
1 rs12565286 C G 0.04139 5349
1 rs2980319 A T 0.1503 5362
1 rs2980300 T C 0.1773 5346
1 rs6603781 A G 0.1149 5346
$
$ ##
$ perl -lne 'chomp; if ($.>1) {if($ARGV eq "file1"){$x{substr($_,3)}=substr($_,3)}
> else {print $_,$x{substr($_,3)}}}' file1 file2
CHR SNP A1 A2 MAF NCHROBS
0 SMP_A-8414268 A G 0.1522 5354
1 rs12565286 C G 0.04139 5349
1 rs2980319 A T 0.1503 5362 rs2980319 A T 0.1503 5362
1 rs2980300 T C 0.1773 5346 rs2980300 T C 0.1773 5346
1 rs6603781 A G 0.1149 5346 rs6603781 A G 0.1149 5346
$
$
Thanks a lot Ripat and tyler_durden.
The awk command worked like a charm.
Can you help me with one more little detail.
Here is my sample file from the previous step:
1 rs4075116 G A 0.2857 546 rs4075116 C T 0.2646 2732
1 rs11260595 T G 0.02451 612 rs11260595 A C 0.02668 2774
1 rs6604968 C T 0.1672 616 rs6604968 G A 0.137 2810
1 rs11260554 A C 0.09547 618 rs11260554 T G 0.1153 2810
1 rs6603781 G A 0.1234 608 rs6603781 A G 0.1196 2810
I want the awk command to read col 3 and then look for that value in Col 8 and 9 on the same row . If it does not find the value in Col 8 and 9, then write the value of Col 2 to the output file output.txt
I am trying to learn the NR==FNR thingy.. until I grasp that.. kindly help.
This is what I came up with, but not sure if it is correct!
# cat f1
CHR SNP A1 A2 MAF NCHROBS
0 SNP_A-8414268 A G 0.1522 5354
1 rs12565286 C G 0.04139 5340
1 rs2980319 A T 0.1503 5362
1 rs2980300 T C 0.1773 5346
1 rs6603781 A G 0.1149 5346
# cat f2
CHR SNP A1 A2 MAF NCHROBS
1 rs2980319 A T 0.1503 4252
1 rs2980300 T C 0.1273 4546
1 rs6603781 G A 0.0249 4546
# # awk 'NR==FNR{$1=$1;a[$2]=$0;b[$2]=$3;next}b[$2]==$3||b[$2]==$4{print $2 > "nonambig.txt"}$1!~"[A-Z]"{$1=a[$2];print}' f1 f2
1 rs2980319 A T 0.1503 5362 rs2980319 A T 0.1503 4252
1 rs2980300 T C 0.1773 5346 rs2980300 T C 0.1273 4546
1 rs6603781 A G 0.1149 5346 rs6603781 G A 0.0249 4546
# cat nonambig.txt
SNP
rs2980319
rs2980300
rs6603781