Matching fields between two files, repeated records

jvoot · January 8, 2018, 6:26pm

In two previous posts (here) and (here), I received help from forum members comparing multiple fields across two files and selectively printing portions of each as output based upon would-be matches using [g]awk. I had been fairly comfortable populating awk arrays with fields and using awk's special "in" statement to find records either matching or non-matching, but had difficulty getting these one-liners transformed into shell scripts and also I encountered an impasse when I realized that awk's behavior is to give precedence to the last line number in an array when one of its elements is repeated (AWK Man page; I hope I stated that correctly).

In order to disambiguate repeated records in arrays I attempted inserting an additional field into a file composed of a discrete numbers. However, I learned in a response to a previous post (thank you Scrutinizer) that awk's pre-increment feature can adequately handle these repeats and thus the need to introduce an additional, discrete, disambiguating field is superfluous.

I am now back to square one trying to put this all together. I apologize in advance for the bevy of similar posts, but I'm hoping that my groping in the dark will prove useful to other users.

I have two files such as:

File1

PS012,007 M>MR                0   2 -1 -1 -1  1 -1   -1 -1  0  2     1   2   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 H                   0   7 -1 -1 -1 -1 -1   -1  3  1  2    -1   7   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 D                  -1   5 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   5   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 MRJ>                0   3 -1 -1 -1  1 -1   -1 -1  0  0     2   3   2   2      -1      -1      -1    0  502     0
PS012,007 M>MR                0   2 -1 -1 -1  1 -1   -1 -1  0  2     3   2   2   2      -1      -1      -1    0  521     0
PS012,007 HW=                 0   8 -1 -1 -1 -1 -1   -1 -1  1  2    -1   8   8   2      -1      -1      -1    0  502     0
PS012,007 DKJ                 0   1  0 17  1  1 -1   62 -1  0  0     3  13  13   2      -1      -1      -1    0  521     0
PS012,007 S>M                 0   2 -1 -1 -1  1 -1   -1 -1  0  2     3   2   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 GBJ                 0   1  0 17  1  1 -1   62 -1  0  0     3  13   2  -1      -1      -1      -1    0  502     0
PS012,007 D                  -1   5 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   6   6  -1      -1      -1      -1    0  519     0
PS012,007 BXR                 0   1  0 17  1  1 -1   62 -1  0  0     2   1   1  -1      -1      -1      -1    0  521     0
PS012,007 B                   0   5 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   5   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 >R<                 0   2 -1 -1 -1  1 -1   -1 -1  0  1     3   2   5   2      -1      -1      -1    0  505     0
PS012,007 W                   0   6 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   6   6  -1      -1      -1      -1    0  509     0
PS012,007 DKJ                 0   1  6 18  1  1 -1   62 -1  0  0     2   1   1  -1      -1      -1      -1    0  521     0
PS012,007 XD                 -1   2 -1 -1 -1  1 -1   -1 -1  0  0     2   2   2   1      -1      -1      -1    0  506     0
PS012,007 B                   0   5 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   5   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 CB<                -1   2 -1 -1 -1  1 -1   -1 -1  0  0     3   2   5   2      -1      -1      -1    0  506     0

File2

PS012,007 >MRH       M>MR
PS012,007 ?          H
PS012,007 ?          D
PS012,007 JHWH       MRJ>
PS012,007 >MRH       M>MR
PS012,007 VHR        ?
PS012,007 ?          HW=
PS012,007 ?          DKJ
PS012,007 KSP        S>M
PS012,007 ?          GBJ
PS012,007 ?          D
PS012,007 YRP        BXR
PS012,007 B          B
PS012,007 <LJL       >R<
PS012,007 L          ?
PS012,007 H          ?
PS012,007 >RY        ?
PS012,007 ?          W
PS012,007 ZQQ        DKJ
PS012,007 CB<=       ?
PS012,007 ?          XD
PS012,007 ?          B
PS012,007 ?          CB<

I would like to accomplish the following: When $1 and $3 of File2 match $1 and $2 of File1, then print $1-$3 of File2 followed by $3-$23 of File1, and if there is a "?" in $3 of File2, then write $1-$3 followed by 21x "-" (dash) after as $4-$24 in the output. However, in my data files there are places where records are repeated in $3 of File2, such as DKJ in the lines

PS012,007 ?          DKJ
PS012,007 ZQQ        DKJ

in the sample provided for File2. The only feature different between $1 and $3 in these two lines is the order in which the appear in the file.

The output should look something like this:

Desired Output

PS012,007 >MRH       M>MR      0   2 -1 -1 -1  1 -1   -1 -1  0  2     1   2   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 ?          H         0   7 -1 -1 -1 -1 -1   -1  3  1  2    -1   7   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 ?          D        -1   5 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   5   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 JHWH       MRJ>      0   3 -1 -1 -1  1 -1   -1 -1  0  0     2   3   2   2      -1      -1      -1    0  502     0
PS012,007 >MRH       M>MR      0   2 -1 -1 -1  1 -1   -1 -1  0  2     3   2   2   2      -1      -1      -1    0  521     0
PS012,007 VHR        ?         -   -  -  -  -  -  -    -  -  -  -     -   -   -   -       -       -       -    -    -     -
PS012,007 ?          HW=       0   8 -1 -1 -1 -1 -1   -1 -1  1  2    -1   8   8   2      -1      -1      -1    0  502     0
PS012,007 ?          DKJ       0   1  0 17  1  1 -1   62 -1  0  0     3  13  13   2      -1      -1      -1    0  521     0
PS012,007 KSP        S>M       0   2 -1 -1 -1  1 -1   -1 -1  0  2     3   2   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 ?          GBJ       0   1  0 17  1  1 -1   62 -1  0  0     3  13   2  -1      -1      -1      -1    0  502     0
PS012,007 ?          D        -1   5 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   6   6  -1      -1      -1      -1    0  519     0
PS012,007 YRP        BXR       0   1  0 17  1  1 -1   62 -1  0  0     2   1   1  -1      -1      -1      -1    0  521     0  
PS012,007 B          B         0   5 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   5   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 <LJL       >R<       0   2 -1 -1 -1  1 -1   -1 -1  0  1     3   2   5   2      -1      -1      -1    0  505     0 
PS012,007 L          ?         -   -  -  -  -  -  -    -  -  -  -     -   -   -   -       -       -       -    -    -     -
PS012,007 H          ?         -   -  -  -  -  -  -    -  -  -  -     -   -   -   -       -       -       -    -    -     -
PS012,007 >RY        ?         -   -  -  -  -  -  -    -  -  -  -     -   -   -   -       -       -       -    -    -     -
PS012,007 ?          W         0   6 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   6   6  -1      -1      -1      -1    0  509     0
PS012,007 ZQQ        DKJ       0   1  6 18  1  1 -1   62 -1  0  0     2   1   1  -1      -1      -1      -1    0  521     0
PS012,007 CB<=       ?         -   -  -  -  -  -  -    -  -  -  -     -   -   -   -       -       -       -    -    -     -
PS012,007 ?          XD       -1   2 -1 -1 -1  1 -1   -1 -1  0  0     2   2   2   1      -1      -1      -1    0  506     0
PS012,007 ?          B         0   5 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   5   0  -1      -1      -1      -1   -1   -1    -1  
PS012,007 ?          CB<      -1   2 -1 -1 -1  1 -1   -1 -1  0  0     3   2   5   2      -1      -1      -1    0  506     0

The code that I am currently attempting as the basis for what I ultimately hope will be a shell script is the following one-liner:

awk 'NR==FNR {A[$1,$2,++C1[$1,$2]]=$0; next}$3 !~/\?/ && ($1,$3,++C2[$1,$3]) in A{print $0,A[$1,$3,C2[$1,$3]]}' File1 File2

The output of this line of code is:

PS012,007 JHWH       MRJ>        PS012,007 MRJ>                0   3 -1 -1 -1  1 -1   -1 -1  0  0     2   3   2   2      -1      -1      -1    0  502     0
PS012,007 >MRH       M>MR        PS012,007 M>MR                0   2 -1 -1 -1  1 -1   -1 -1  0  2     3   2   2   2      -1      -1      -1    0  521     0
PS012,007 ?          HW=         PS012,007 HW=                 0   8 -1 -1 -1 -1 -1   -1 -1  1  2    -1   8   8   2      -1      -1      -1    0  502     0
PS012,007 ?          DKJ         PS012,007 DKJ                 0   1  0 17  1  1 -1   62 -1  0  0     3  13  13   2      -1      -1      -1    0  521     0
PS012,007 KSP        S>M         PS012,007 S>M                 0   2 -1 -1 -1  1 -1   -1 -1  0  2     3   2   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 ?          GBJ         PS012,007 GBJ                 0   1  0 17  1  1 -1   62 -1  0  0     3  13   2  -1      -1      -1      -1    0  502     0
PS012,007 ?          D           PS012,007 D                  -1   5 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   6   6  -1      -1      -1      -1    0  519     0
PS012,007 YRP        BXR         PS012,007 BXR                 0   1  0 17  1  1 -1   62 -1  0  0     2   1   1  -1      -1      -1      -1    0  521     0
PS012,007 B          B           PS012,007 B                   0   5 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   5   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 <LJL       >R<         PS012,007 >R<                 0   2 -1 -1 -1  1 -1   -1 -1  0  1     3   2   5   2      -1      -1      -1    0  505     0
PS012,007 ?          W           PS012,007 W                   0   6 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   6   6  -1      -1      -1      -1    0  509     0
PS012,007 ZQQ        DKJ         PS012,007 DKJ                 0   1  6 18  1  1 -1   62 -1  0  0     2   1   1  -1      -1      -1      -1    0  521     0
PS012,007 ?          XD          PS012,007 XD                 -1   2 -1 -1 -1  1 -1   -1 -1  0  0     2   2   2   1      -1      -1      -1    0  506     0
PS012,007 ?          B           PS012,007 B                   0   5 -1 -1 -1 -1 -1   -1 -1 -1 -1    -1   5   0  -1      -1      -1      -1   -1   -1    -1
PS012,007 ?          CB<         PS012,007 CB<                -1   2 -1 -1 -1  1 -1   -1 -1  0  0     3   2   5   2      -1      -1      -1    0  506     0

This gets me close to what I am after, however, it appears that $1 and $2 of File1 (as $4 and $5 in output) is intervening between $4 and $6 when it should not. Issue (1) is that I cannot adjust my print command correctly to get rid of those fields so as to match the "desired output."

The second two issues I am experiencing are related. I cannot seem to get this one-liner to work as a script. Secondly, and most obviously, there are no "dashes" present for I did not supply an "if/else if" statement. To do so, I have attempted the following TestScript1.awk and TestScript2.awk as follows:

TestScript1.awk

FNR==NR {
	q=$1 SUBSEP $2 SUBSEP ++C1[$1,$2]
	A[q]=$0
	next
}
{
if($3 !~/\?/ && ($1,$3,++C2[$1,$3]) in A)
	print $0,A[$1,$3,C2[$1,$3]]
else if($3 ~/\?/)
	print $0, " - - - - - - - - - - - - - - - - - - - - - "
}

TestScript2.awk

FNR==NR {
	A[$1,$2,++C1[$1,$2]]=$0
	next
}
{
if($3 !~/\?/ && ($1,$3,++C2[$1,$3]) in A)
	print $0,A[$1,$3,C2[$1,$3]]
else if($3 ~/\?/)
	print $0, " - - - - - - - - - - - - - - - - - - - - - "
}

My hope is to learn inductively by working backwards from a representative, correct code so as to save the Desired Output as a file and then repeat the search and print method with $1 and $2 of File2 (rather than $1 and $3) with a different File1.

Thank you all so much and hopefully this results in codes and methods than can be beneficial to others as well.

RudiC · January 9, 2018, 8:22am

Except for the formatting, how far would this get you:

awk '
NR == FNR       {q = $1 "," $2
                 $1 = $2 = ""
                 T[q "," ++C[q]] = $0
                 next
                }
                {q = $1 "," $3
                 X = q "," ++D[q]
                 printf "%s\t",  $0
                 if (X in T)    print T[X]
                 if ($3 == "?") print "  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -"
                }
' file1 file2

Please note, that

does not quite accurately describe the situation - array elements are just overwritten when indices are encountered another time(s), and awk does this as any other programming language.

jvoot · January 9, 2018, 1:47pm

rudic:

Except for the formatting, how far would this get you:
awk '
NR == FNR       {q = $1 "," $2
   $1 = $2 = ""
   T[q "," ++C[q]] = $0
   next
   }
   {q = $1 "," $3
   X = q "," ++D[q]
   printf "%s\t",  $0
   if (X in T)    print T[X]
   if ($3 == "?") print "  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -"
   }
' file1 file2
Please note, that does not quite accurately describe the situation - array elements are just overwritten when indices are encountered another time(s), and awk does this as any other programming language.

Thank you so much for this RudiC. After a quick test, initial indications are that that did the trick and I can run it through a printf to take care of the formatting.

I have not seen before the "," in your variable and then "T" array. I suspect this is giving an extra coordinate to the array to make it multi-dimensional but will do more research in AWK man. I really appreciate your help.

RudiC · January 10, 2018, 4:18am

In fact, the "," overrides the SUBSEP character in the array index creation (as can do any character, e.g. <TAB>). I use this sometimes if I print out the array with inidices for debug purposes.