In two previous posts (here) and (here), I received help from forum members comparing multiple fields across two files and selectively printing portions of each as output based upon would-be matches using [g]awk. I had been fairly comfortable populating awk arrays with fields and using awk's special "in" statement to find records either matching or non-matching, but had difficulty getting these one-liners transformed into shell scripts and also I encountered an impasse when I realized that awk's behavior is to give precedence to the last line number in an array when one of its elements is repeated (AWK Man page; I hope I stated that correctly).
In order to disambiguate repeated records in arrays I attempted inserting an additional field into a file composed of a discrete numbers. However, I learned in a response to a previous post (thank you Scrutinizer) that awk's pre-increment feature can adequately handle these repeats and thus the need to introduce an additional, discrete, disambiguating field is superfluous.
I am now back to square one trying to put this all together. I apologize in advance for the bevy of similar posts, but I'm hoping that my groping in the dark will prove useful to other users.
I have two files such as:
File1
PS012,007 M>MR 0 2 -1 -1 -1 1 -1 -1 -1 0 2 1 2 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 H 0 7 -1 -1 -1 -1 -1 -1 3 1 2 -1 7 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 D -1 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 5 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 MRJ> 0 3 -1 -1 -1 1 -1 -1 -1 0 0 2 3 2 2 -1 -1 -1 0 502 0
PS012,007 M>MR 0 2 -1 -1 -1 1 -1 -1 -1 0 2 3 2 2 2 -1 -1 -1 0 521 0
PS012,007 HW= 0 8 -1 -1 -1 -1 -1 -1 -1 1 2 -1 8 8 2 -1 -1 -1 0 502 0
PS012,007 DKJ 0 1 0 17 1 1 -1 62 -1 0 0 3 13 13 2 -1 -1 -1 0 521 0
PS012,007 S>M 0 2 -1 -1 -1 1 -1 -1 -1 0 2 3 2 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 GBJ 0 1 0 17 1 1 -1 62 -1 0 0 3 13 2 -1 -1 -1 -1 0 502 0
PS012,007 D -1 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 6 6 -1 -1 -1 -1 0 519 0
PS012,007 BXR 0 1 0 17 1 1 -1 62 -1 0 0 2 1 1 -1 -1 -1 -1 0 521 0
PS012,007 B 0 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 5 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 >R< 0 2 -1 -1 -1 1 -1 -1 -1 0 1 3 2 5 2 -1 -1 -1 0 505 0
PS012,007 W 0 6 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 6 6 -1 -1 -1 -1 0 509 0
PS012,007 DKJ 0 1 6 18 1 1 -1 62 -1 0 0 2 1 1 -1 -1 -1 -1 0 521 0
PS012,007 XD -1 2 -1 -1 -1 1 -1 -1 -1 0 0 2 2 2 1 -1 -1 -1 0 506 0
PS012,007 B 0 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 5 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 CB< -1 2 -1 -1 -1 1 -1 -1 -1 0 0 3 2 5 2 -1 -1 -1 0 506 0
File2
PS012,007 >MRH M>MR
PS012,007 ? H
PS012,007 ? D
PS012,007 JHWH MRJ>
PS012,007 >MRH M>MR
PS012,007 VHR ?
PS012,007 ? HW=
PS012,007 ? DKJ
PS012,007 KSP S>M
PS012,007 ? GBJ
PS012,007 ? D
PS012,007 YRP BXR
PS012,007 B B
PS012,007 <LJL >R<
PS012,007 L ?
PS012,007 H ?
PS012,007 >RY ?
PS012,007 ? W
PS012,007 ZQQ DKJ
PS012,007 CB<= ?
PS012,007 ? XD
PS012,007 ? B
PS012,007 ? CB<
I would like to accomplish the following: When $1 and $3 of File2 match $1 and $2 of File1, then print $1-$3 of File2 followed by $3-$23 of File1, and if there is a "?" in $3 of File2, then write $1-$3 followed by 21x "-" (dash) after as $4-$24 in the output. However, in my data files there are places where records are repeated in $3 of File2, such as DKJ in the lines
PS012,007 ? DKJ
PS012,007 ZQQ DKJ
in the sample provided for File2. The only feature different between $1 and $3 in these two lines is the order in which the appear in the file.
The output should look something like this:
Desired Output
PS012,007 >MRH M>MR 0 2 -1 -1 -1 1 -1 -1 -1 0 2 1 2 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 ? H 0 7 -1 -1 -1 -1 -1 -1 3 1 2 -1 7 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 ? D -1 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 5 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 JHWH MRJ> 0 3 -1 -1 -1 1 -1 -1 -1 0 0 2 3 2 2 -1 -1 -1 0 502 0
PS012,007 >MRH M>MR 0 2 -1 -1 -1 1 -1 -1 -1 0 2 3 2 2 2 -1 -1 -1 0 521 0
PS012,007 VHR ? - - - - - - - - - - - - - - - - - - - - -
PS012,007 ? HW= 0 8 -1 -1 -1 -1 -1 -1 -1 1 2 -1 8 8 2 -1 -1 -1 0 502 0
PS012,007 ? DKJ 0 1 0 17 1 1 -1 62 -1 0 0 3 13 13 2 -1 -1 -1 0 521 0
PS012,007 KSP S>M 0 2 -1 -1 -1 1 -1 -1 -1 0 2 3 2 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 ? GBJ 0 1 0 17 1 1 -1 62 -1 0 0 3 13 2 -1 -1 -1 -1 0 502 0
PS012,007 ? D -1 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 6 6 -1 -1 -1 -1 0 519 0
PS012,007 YRP BXR 0 1 0 17 1 1 -1 62 -1 0 0 2 1 1 -1 -1 -1 -1 0 521 0
PS012,007 B B 0 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 5 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 <LJL >R< 0 2 -1 -1 -1 1 -1 -1 -1 0 1 3 2 5 2 -1 -1 -1 0 505 0
PS012,007 L ? - - - - - - - - - - - - - - - - - - - - -
PS012,007 H ? - - - - - - - - - - - - - - - - - - - - -
PS012,007 >RY ? - - - - - - - - - - - - - - - - - - - - -
PS012,007 ? W 0 6 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 6 6 -1 -1 -1 -1 0 509 0
PS012,007 ZQQ DKJ 0 1 6 18 1 1 -1 62 -1 0 0 2 1 1 -1 -1 -1 -1 0 521 0
PS012,007 CB<= ? - - - - - - - - - - - - - - - - - - - - -
PS012,007 ? XD -1 2 -1 -1 -1 1 -1 -1 -1 0 0 2 2 2 1 -1 -1 -1 0 506 0
PS012,007 ? B 0 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 5 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 ? CB< -1 2 -1 -1 -1 1 -1 -1 -1 0 0 3 2 5 2 -1 -1 -1 0 506 0
The code that I am currently attempting as the basis for what I ultimately hope will be a shell script is the following one-liner:
awk 'NR==FNR {A[$1,$2,++C1[$1,$2]]=$0; next}$3 !~/\?/ && ($1,$3,++C2[$1,$3]) in A{print $0,A[$1,$3,C2[$1,$3]]}' File1 File2
The output of this line of code is:
PS012,007 JHWH MRJ> PS012,007 MRJ> 0 3 -1 -1 -1 1 -1 -1 -1 0 0 2 3 2 2 -1 -1 -1 0 502 0
PS012,007 >MRH M>MR PS012,007 M>MR 0 2 -1 -1 -1 1 -1 -1 -1 0 2 3 2 2 2 -1 -1 -1 0 521 0
PS012,007 ? HW= PS012,007 HW= 0 8 -1 -1 -1 -1 -1 -1 -1 1 2 -1 8 8 2 -1 -1 -1 0 502 0
PS012,007 ? DKJ PS012,007 DKJ 0 1 0 17 1 1 -1 62 -1 0 0 3 13 13 2 -1 -1 -1 0 521 0
PS012,007 KSP S>M PS012,007 S>M 0 2 -1 -1 -1 1 -1 -1 -1 0 2 3 2 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 ? GBJ PS012,007 GBJ 0 1 0 17 1 1 -1 62 -1 0 0 3 13 2 -1 -1 -1 -1 0 502 0
PS012,007 ? D PS012,007 D -1 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 6 6 -1 -1 -1 -1 0 519 0
PS012,007 YRP BXR PS012,007 BXR 0 1 0 17 1 1 -1 62 -1 0 0 2 1 1 -1 -1 -1 -1 0 521 0
PS012,007 B B PS012,007 B 0 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 5 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 <LJL >R< PS012,007 >R< 0 2 -1 -1 -1 1 -1 -1 -1 0 1 3 2 5 2 -1 -1 -1 0 505 0
PS012,007 ? W PS012,007 W 0 6 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 6 6 -1 -1 -1 -1 0 509 0
PS012,007 ZQQ DKJ PS012,007 DKJ 0 1 6 18 1 1 -1 62 -1 0 0 2 1 1 -1 -1 -1 -1 0 521 0
PS012,007 ? XD PS012,007 XD -1 2 -1 -1 -1 1 -1 -1 -1 0 0 2 2 2 1 -1 -1 -1 0 506 0
PS012,007 ? B PS012,007 B 0 5 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 5 0 -1 -1 -1 -1 -1 -1 -1
PS012,007 ? CB< PS012,007 CB< -1 2 -1 -1 -1 1 -1 -1 -1 0 0 3 2 5 2 -1 -1 -1 0 506 0
This gets me close to what I am after, however, it appears that $1 and $2 of File1 (as $4 and $5 in output) is intervening between $4 and $6 when it should not. Issue (1) is that I cannot adjust my print command correctly to get rid of those fields so as to match the "desired output."
The second two issues I am experiencing are related. I cannot seem to get this one-liner to work as a script. Secondly, and most obviously, there are no "dashes" present for I did not supply an "if/else if" statement. To do so, I have attempted the following TestScript1.awk and TestScript2.awk as follows:
TestScript1.awk
FNR==NR {
q=$1 SUBSEP $2 SUBSEP ++C1[$1,$2]
A[q]=$0
next
}
{
if($3 !~/\?/ && ($1,$3,++C2[$1,$3]) in A)
print $0,A[$1,$3,C2[$1,$3]]
else if($3 ~/\?/)
print $0, " - - - - - - - - - - - - - - - - - - - - - "
}
TestScript2.awk
FNR==NR {
A[$1,$2,++C1[$1,$2]]=$0
next
}
{
if($3 !~/\?/ && ($1,$3,++C2[$1,$3]) in A)
print $0,A[$1,$3,C2[$1,$3]]
else if($3 ~/\?/)
print $0, " - - - - - - - - - - - - - - - - - - - - - "
}
My hope is to learn inductively by working backwards from a representative, correct code so as to save the Desired Output as a file and then repeat the search and print method with $1 and $2 of File2 (rather than $1 and $3) with a different File1.
Thank you all so much and hopefully this results in codes and methods than can be beneficial to others as well.