Comparing two files using four fields

Dear All,
I want to compare File1 and File2 (Separated by spaces) using four fields (Column 1,2,4,5).
Logic: If column 1 and 2 of File1 and File2 match exactly and if the File2 has the same characters as any of the characters present in column 4 and 5 of file1 then those lines of file1 and file2 are concatenated and redirected as output.
File1:

s2/80 20 . A T 86 N=2 F=5;U=4
s2/20 10 . G T 90 N=2 F=5;U=4 
s2/90 60 . C G 30 N=2 F=5;U=4
s2/40 70 . A G 80 N=2 F=5;U=4

File2:

s2/90 60 . G G 97 N=2 F=5;U=4 
s2/80 20 . A A 20 N=2 F=5;U=4 
s2/15 11 . A A 22 N=2 F=5;U=4 
s2/90 21 . C C 82 N=2 F=5;U=4 
s2/20 10 . G G 99 N=2 F=5;U=4
s2/40 70 . A G 70 N=2 F=5;U=4
s2/80 10 . T G 11 N=2 F=5;U=4 
s2/90 60 . G T 55 N=2 F=5;U=4

Expected Output:

s2/80 20 . A T 86 N=2 F=5;U=4 s2/80 20 . A A 20 N=2 F=5;U=4 
s2/20 10 . G T 90 N=2 F=5;U=4 s2/20 10 . G G 99 N=2 F=5;U=4 
s2/90 60 . C G 30 N=2 F=5;U=4 s2/90 60 . G G 97 N=2 F=5;U=4

I am new in the field and I would appreciate your help.

  • Why is there no output for s2/40 70
  • What does this mean:

Because the 4th and the 5th column has A G for both File1 and 2. If File1 has A G at 4th and at 5th column, then I want to select only those which as A A or G G in File2. The logic is if in File 1, there is "X" in column 4 and "Y" in column 5, I want to select only those which has "X" "X" or "Y" "Y" in File 2 at 4th and 5th column.

  • s2/40 70 . A G 80 N=2 F=5;U=4

Try:

awk 'NR==FNR{A[$1,$2,$4]=$0; A[$1,$2,$5]=$0; next} $4==$5 && ($1,$2,$4) in A {print A[$1,$2,$4] ";" $0}' file1 file2
1 Like

It works. Thank you very much.

The output gives a semicolon where it concatenates. How do i avoid this ";"

s2/80  20 . A  T 86  N=2 F=5;U=4;s2/80 20 . A A 20 N=2 F=5;U=4  
s2/20 10 .  G  T 90  N=2 F=5;U=4;s2/20 10 . G G 99 N=2 F=5;U=4  
s2/90 60 .  C  G 30  N=2 F=5;U=4;s2/90 60 . G G 97 N=2 F=5;U=4 

As i want the output file to look like this:

s2/80 20 . A T 86 N=2 F=5;U=4 s2/80 20 . A A 20 N=2 F=5;U=4  
s2/20 10 . G T 90 N=2 F=5;U=4 s2/20 10 . G G 99 N=2 F=5;U=4  
s2/90 60 . C G 30 N=2 F=5;U=4 s2/90 60 . G G 97 N=2 F=5;U=4

NamS, if you look at the awk command suggested by Scrutinizer, you will see

";"

You should try replacing with the below if you wish a space.

" "

Thanks mjf.