Comparing two files

Hello all, I am not a programmer, but I require a little help with a project I am doing. I did read several posts and looks like awk or python may help me, though I know very little about using them. Here is my question: I have first file with 6 column.

CHR             SNP   A1   A2          MAF  NCHROBS
    0   SNP_A-8414268    A    G       0.1522     5354
   1      rs12565286    C    G      0.04139     5340
   1       rs2980319    A    T       0.1503     5362
   1       rs2980300    T    C       0.1773     5346
   1       rs6603781    A    G       0.1149     5346

My second file is very similar to the first one, but it may or may not have the same Column 2(SNP). I suspect that the columns Col 3 and 4 (A1 & A2) may be different as well.

What I require is to get an output file with columns 1,2,3, 4, 5, 6 from the first file and the corresponding line that matches column 2 (SNP) of the FIRST file with the columns 2,3,4,5,6 (SNP, A1, A2, MAF, NCHROBS) from SECOND file at positions 7,8,9,10,11,2. The output file will hence have 11 columns; the first 6 from file 1.txt and the matching last five from file 2.txt

CHR             SNP   A1   A2          MAF  NCHROBS
       0   SNP_A-8414268    A    G       0.1522     5354
   1      rs12565286    C    G      0.04139     5340
   1       rs2980319    A    T       0.1503     5362    rs2980319    A    T       0.1503     4252
   1       rs2980300    T    C       0.1773     5346    rs2980300    T    C       0.1273     4546
   1       rs6603781    A    G       0.1149     5346    rs6603781    G    A       0.0249     4546

Thank you for reading

Hi,

Assuming your file1 and file2 are as follows:

$ cat f1
CHR             SNP   A1   A2          MAF  NCHROBS
    0   SNP_A-8414268    A    G       0.1522     5354
   1      rs12565286    C    G      0.04139     5340
   1       rs2980319    A    T       0.1503     5362
   1       rs2980300    T    C       0.1773     5346
   1       rs6603781    A    G       0.1149     5346

$ cat f2
CHR             SNP   A1   A2          MAF  NCHROBS
   1       rs2980319    A    T       0.1503     4252
   1       rs2980300    T    C       0.1273     4546
   1       rs6603781    G    A       0.0249     4546

Try this:

$ awk 'NR==FNR{k[$2]=sprintf(" %s %s %s %s %s",$2,$3,$4,$5,$6);next}{print $1,$2,$3,$4,$5,$6 k[$2]}' f2 f1
CHR SNP A1 A2 MAF NCHROBS SNP A1 A2 MAF NCHROBS
0 SNP_A-8414268 A G 0.1522 5354
1 rs12565286 C G 0.04139 5340
1 rs2980319 A T 0.1503 5362 rs2980319 A T 0.1503 4252
1 rs2980300 T C 0.1773 5346 rs2980300 T C 0.1273 4546
1 rs6603781 A G 0.1149 5346 rs6603781 G A 0.0249 4546
$ 
$ cat file1
CHR             SNP   A1   A2          MAF  NCHROBS
  0   SNP_A-8414268    A    G       0.1522     5354
  1      rs12565286    C    G      0.04139     5340
  1       rs2980319    A    T       0.1503     5362
  1       rs2980300    T    C       0.1773     5346
  1       rs6603781    A    G       0.1149     5346
$ 
$ 
$ cat file2
CHR             SNP   A1   A2          MAF  NCHROBS
  0   SMP_A-8414268    A    G       0.1522     5354
  1      rs12565286    C    G      0.04139     5349
  1       rs2980319    A    T       0.1503     5362
  1       rs2980300    T    C       0.1773     5346
  1       rs6603781    A    G       0.1149     5346
$ 
$ ##
$ perl -lne 'chomp; if ($.>1) {if($ARGV eq "file1"){$x{substr($_,3)}=substr($_,3)}
>            else {print $_,$x{substr($_,3)}}}' file1 file2
CHR             SNP   A1   A2          MAF  NCHROBS
  0   SMP_A-8414268    A    G       0.1522     5354
  1      rs12565286    C    G      0.04139     5349
  1       rs2980319    A    T       0.1503     5362       rs2980319    A    T       0.1503     5362
  1       rs2980300    T    C       0.1773     5346       rs2980300    T    C       0.1773     5346
  1       rs6603781    A    G       0.1149     5346       rs6603781    A    G       0.1149     5346
$ 
$ 

tyler_durden

Thanks a lot Ripat and tyler_durden.
The awk command worked like a charm.
Can you help me with one more little detail.

Here is my sample file from the previous step:

1	rs4075116	G	A	0.2857	546	rs4075116	C	T	0.2646	2732
1	rs11260595	T	G	0.02451	612	rs11260595	A	C	0.02668	2774
1	rs6604968	C	T	0.1672	616	rs6604968	G	A	0.137	2810
1	rs11260554	A	C	0.09547	618	rs11260554	T	G	0.1153	2810
1	rs6603781	G	A	0.1234	608	rs6603781	A	G	0.1196	2810

I want the awk command to read col 3 and then look for that value in Col 8 and 9 on the same row . If it does not find the value in Col 8 and 9, then write the value of Col 2 to the output file output.txt

I am trying to learn the NR==FNR thingy.. until I grasp that.. kindly help.
This is what I came up with, but not sure if it is correct! :frowning:

awk '{if(NR>1 && $3 !=$8 && $3!=$9){print  $2}}' All_matchingsnps_in_bothdatasets.txt >nonambig.txt

Base on original data sample:

# cat f1
CHR             SNP   A1   A2          MAF  NCHROBS
    0   SNP_A-8414268    A    G       0.1522     5354
   1      rs12565286    C    G      0.04139     5340
   1       rs2980319    A    T       0.1503     5362
   1       rs2980300    T    C       0.1773     5346
   1       rs6603781    A    G       0.1149     5346
# cat f2
CHR             SNP   A1   A2          MAF  NCHROBS
   1       rs2980319    A    T       0.1503     4252
   1       rs2980300    T    C       0.1273     4546
   1       rs6603781    G    A       0.0249     4546
# # awk 'NR==FNR{$1=$1;a[$2]=$0;b[$2]=$3;next}b[$2]==$3||b[$2]==$4{print $2 > "nonambig.txt"}$1!~"[A-Z]"{$1=a[$2];print}' f1 f2
1 rs2980319 A T 0.1503 5362 rs2980319 A T 0.1503 4252
1 rs2980300 T C 0.1773 5346 rs2980300 T C 0.1273 4546
1 rs6603781 A G 0.1149 5346 rs6603781 G A 0.0249 4546
# cat nonambig.txt
SNP
rs2980319
rs2980300
rs6603781

Can you kindly help by interpreting this code, that will help me understand it. It would be very helpful and I would appreciate it very much.

 awk 'NR==FNR{$1=$1;a[$2]=$0;b[$2]=$3;next}b[$2]==$3||b[$2]==$4{print $2 > "nonambig.txt"}$1!~"[A-Z]"{$1=a[$2];print}' f1 f2