Match Strings between two files, print portions of each file together when matched ([g]awk)

I have two files and desire to use the strings from $1 of file 1 (file1.txt) as search criteria to find matches in $2 of file 2 (file2.txt). If matches are found I want to output the entire line of file 2 (file2.txt) followed by fields $2-$11 of file 1 (file1.txt). I can find the matches, I cannot seem to get both files to print correctly. The data looks thusly:

file1.txt

  1 >CR=        2 -1 -1 -1  5  0 -1 -1  3  2
  2 H           0 -1 -1 -1 -1 -1 -1 -1 -1 -1
  3 >JC         2 -1 -1 -1  1  0 -1 -1  1  2
  4 
  5 >CR         6 -1 -1 -1 -1 -1 -1 -1 -1 -1

file2.txt

  1  2 >CR=                 2 VWB
  2  2 >JC                  2 GBR
  3                           
  4  6 >CR                  6 D

Desired Output:

1 2 >CR=                 2 VWB    2 -1 -1 -1  5  0 -1 -1  3  2
2 2 >JC                  2 GBR    2 -1 -1 -1  1  0 -1 -1  1  2
3                          
4 6 >CR                  6 D      6 -1 -1 -1 -1 -1 -1 -1 -1 -1

I can find the matches between the two files and respective fields easy enough and print the lines of file1.txt with:

awk 'NR==FNR{a[$1];next}$2 in a{print $0}' file1.txt file2.txt

Yet, when it comes to printing the desired fields from the second file after the lines of the first file, this is where I run aground.

Thanks in advance for your help and if I may, I'm wondering if the respondent might offer a word or two by way of explanation for as a beginner I seem to often run into difficulty when attempting to use arrays in awk.

Hello jvoot,

Could you please try following and let me know if this helps you.

awk 'FNR==NR{q=$2;$1=$2="";A[q]=$0;next} ($3 in A){print $0,A[$3]}'  Input_file1   Input_file2

Thanks,
R. Singh

Thanks so much for this R. Singh. Unfortunately, it didn't seem to do the trick. For example, line 4 of 'Desired Output' should be:

4 6 >CR                  6 D      6 -1 -1 -1 -1 -1 -1 -1 -1 -1

But in the code you offered, it is:

6 >CR                  6 D   -1 -1 -1 -1 -1 -1 -1 -1 -1

As I look through $5 of the output of your code, which is supposed to constitute $2 of file1.txt, I see values there that do not exist in the entirety $2 for the entire file (i.e., $5 of output has a value that is not contained in $2 of file1.txt at all).

Thanks so much for the attempt though!

Hello jvoot,

I am seeing output from my code which you are expecting only, see following.

awk 'FNR==NR{q=$2;$1=$2="";A[q]=$0;next} ($3 in A){print $0,A[$3]}' file1  file2
1  2 >CR=                 2 VWB   2 -1 -1 -1 5 0 -1 -1 3 2
2  2 >JC                  2 GBR   2 -1 -1 -1 1 0 -1 -1 1 2
3
4  6 >CR                  6 D   6 -1 -1 -1 -1 -1 -1 -1 -1 -1
 

Thanks,
R. Singh

1 Like

Interesting. I just ran it again, and it is it doesn't seem to be working. Indeed, the four lines of output you posted are correct, but the four that I get are off. I'm going to have to do some more investigating.

Here is my output for

awk 'FNR==NR{q=$2;$1=$2="";A[q]=$0;next} ($3 in A){print $0,A[$3]}' file1 file2

 2 >CR=                 2 VWB   -1 -1 -1 1 0 -1 -1 1 0
 2 >JC                  2 GBR   -1 -1 -1 1 0 -1 -1 1 0
                            
 6 >CR                  6 D   -1 -1 -1 -1 -1 -1 -1 -1 -1

---------- Post updated at 08:54 PM ---------- Previous update was at 08:41 PM ----------

Ah, R. Singh! I think I figured it out! In your code:

awk 'FNR==NR{q=$2;$1=$2="";A[q]=$0;next} ($3 in A){print $0,A[$3]}' file1 file2

You have both fields $1 and $2, set to null ("") and therefore it is just repeating one of the fields when it goes to print. If only $1 of file1 is set to null ($1="") then it works as you say for the lines of output here.

awk 'FNR==NR{q=$2;$1="";A[q]=$0;next} ($3 in A){print $0,A[$3]}' file1 file2

 2 >CR=                 2 VWB  2 -1 -1 -1 1 0 -1 -1 1 0
 2 >JC                  2 GBR  2 -1 -1 -1 1 0 -1 -1 1 0
                           
 6 >CR                  6 D  6 -1 -1 -1 -1 -1 -1 -1 -1 -1

However, even at that, as I scan through the rest of the output there are other lines not showing the correct values.

A small correction to RavinderSingh13's yields

awk '
FNR==NR         {q    = $1
                 $1   = $2 = ""
                 A[q] = $0
                 next
                }
                {print $0, A[$2]
                }
'  file1 file2
  1  2 >CR=                 2 VWB   0 -1 -1 -1 -1 -1 -1 -1 -1 -1
  2  2 >JC                  2 GBR   0 -1 -1 -1 -1 -1 -1 -1 -1 -1
  3                            
  4  6 >CR                  6 D 

which is as close to the requested output as you can get. Please note that the desired output given in post#1 does NOT satisfy the specification as there's NO match of file2's line 4's 6 in file1!

1 Like

I see what was wrong. In my example data I foolishly included the line numbers from my file. Once I adjusted fields for this error it worked flawlessly.

Thank you so much RudiC and R. Singh.

If it is not too much trouble, would someone be kind enough to explain this code for me? I'm tracking fairly well but got confused when variable "q" is assigned to $2 but then is included in array A.

Hello jvoot,

Could you please go through following explanation and let me know if this helps.

awk 'FNR==NR{                  ###This condition will be TRUE when first file named file1 is being read.
                q=$2;          ###creating variable named q whose value is $2(second field) of current line.
                $1=$2="";      ###Nullifying first and second fields now in a current line.
                A[q]=$0;       ###Creating array named A whose index is variable q and value is current line.
                next           ###next is awk built-in keyword and it will skip all next statement.
            }
    ($3 in A){                 ###Traversing through array A with $3, if $3 comes in array A then do following.
                print $0,A[$3] ###printing the current line and array A value with index $3 of it.
             }
    ' file1  file2             ###Mentioning the Input_files file1 and file2 here.
 

Thanks,
R. Singh

1 Like