Comparing two files and inserting new fields

GoldenFire · November 17, 2010, 6:52pm

Hi all,

I searched the forum and tried to learn from the similar posts. However, I am new and I need to get help on this. I hope an expert kindly help me to sort this out.

I need to compare field 1 and 2 of the first file with the same fields of the second file and if both fields matches insert the field 9 and 10 of the second file to the position 1 and 2 of the first file.

1st file:

F1      F2      F3      F4      F5      F6      F7      F8      F9      F10     F11     F12     F13     F14
180193  99999   15960   1       18      19      3       16      11      0       54      0.01    99999   99999
180193  99999   51255   1       19      16      3       5       10      0       55      0       0       7
999999  13721   12548   1       20      9       -12     7       -1      0       66      0       0       7
180193  99999   51255   1       19      16      3       5       10      0       55      0       0       7
999999  15264   11446   1       22      10      -9      9       1       0       64      99999   99999   99999
170265  99999   15960   1       26      17      3       12      8       0       54      0.01    99999   8

2nd file:

F1      F2       F3              F4             F5      F6      F7      F8      F9      F10     F11     
150546  99999   DFKMDNBL        MFDNDVHFD       25      MH      2       2       90260   258794  1296                    
152602  99999   GFMMBDFD        DFGDGDBGB       65      RF      3       6       30268   259761  907                     
160940  99999   DFGHDGTH        BBVCSDRRG       98      WD      5       5       65923   244552  720                     
165230  99999   HHDDHRTT        GTTHDTGBH       32      AS      4       6       25430   246695  1265                    
170265  99999   RTVDVRRE        EEWFCSDFF       65      CD      9       5       26980   265986  1069                    
180193  99999   VVDBFYHK        NCMKSOSUF       25      YG      1       8       65971   245695  1089                    
184021  99999   DVGNWEPE        POSUGBNCB       98      FF      7       3       15482   256589  1315            
189750  99999   DFGGGHPL        FJFFDKSJSQ      65      DR      5       3       45681   236659  1329            
999999  13721   FREREGHH        CVFKCJUPK       35      PW      2       3       54261   210546  1122                    
999999  15264   GTUKPBCS        HGFJFJZASS      14      PK      2       5       22976   236598  1225

I'd like to have this:

F1      F2      F3      F4      F5      F6      F7      F8      F9      F10     F11     F12     F13     F14     F15     F16
65971   245695  180193  99999   15960   1       18      19      3       16      11      0       54      0.01    99999   99999   
65971   245695  180193  99999   51255   1       19      16      3       5       10      0       55      0       0       7       
54261   210546  999999  13721   12548   1       20      9       -12     7       -1      0       66      0       0       7       
65971   245695  180193  99999   51255   1       19      16      3       5       10      0       55      0       0       7       
22976   236598  999999  15264   11446   1       22      10      -9      9       1       0       64      99999   99999   99999   
26980   265986  170265  99999   15960   1       26      17      3       12      8       0       54      0.01    99999   8

Thanks in advance.

Chubler_XL · November 17, 2010, 9:06pm

Here you go

(With Fn headings in input files and output)

awk 'NR == FNR { A[$1"|"$2]=$9; B[$1"|"$2]=$10; OFS="\t" ; next}
 FNR == 1 { printf "F1\tF2\tF3\tF4\tF5\tF6\tF7\tF8\tF9\tF10\tF11\tF12\tF13\tF14\tF15\tF16\n" }
 FNR > 1 { print A[$1"|"$2], B[$1"|"$2], $0; } ' fileB fileA

(Without Fn headings)

awk 'NR == FNR { A[$1"|"$2]=$9; B[$1"|"$2]=$10; OFS="\t" ; next}
 { print A[$1"|"$2], B[$1"|"$2], $0; } ' fileB fileA

rdcwayx · November 17, 2010, 9:29pm

@Chubler_XL

Your output is right, because FileA (1st file) 's $1 and $2 are all in FileB (2nd file). If there are some different, you still print without $9,$10.

I change one number in 1st file.

F1      F2      F3      F4      F5      F6      F7      F8      F9      F10     F11     F12     F13     F14     F15     F16
65971   245695  180193  99999   15960   1       18      19      3       16      11      0       54      0.01    99999   99999
                180191  99999   51255   1       19      16      3       5       10      0       55      0       0       7
54261   210546  999999  13721   12548   1       20      9       -12     7       -1      0       66      0       0       7
65971   245695  180193  99999   51255   1       19      16      3       5       10      0       55      0       0       7
22976   236598  999999  15264   11446   1       22      10      -9      9       1       0       64      99999   99999   99999
26980   265986  170265  99999   15960   1       26      17      3       12      8       0       54      0.01    99999   8

Here is my code with headings.

awk 'NR==FNR {A[$1 FS $2]=$9 OFS $10; next}
FNR==1 {for (i=1;i<=NF+2;i++) printf "F"i OFS ;printf ORS}
FNR >1&&($1 FS $2 in A) { print A[$1 FS $2], $0; } '  OFS="\t"  FileB FileA

Chubler_XL · November 17, 2010, 9:41pm

Good point, but in my defense the original request wasn't specific about what should be output when a match wasn't found. It's probably slightly safer to put blank values for 9 and 10 rather than hiding the whole row.

BTW, I also considered a for loop to print the headings, but I suspect the real data has no headings or data specific headings that the OP renamed for ease of reference.

Great idea to use 9 OFS 10 as the hash array data.

GoldenFire · November 17, 2010, 10:07pm

Chubler_XL and rdcwayx ,
I can't thank you enough for your time and help.

My data has no headings and I just put them to make my code clear.
I tried

awk 'NR == FNR { A[$1"|"$2]=$9; B[$1"|"$2]=$10; OFS="\t" ; next}
 { print A[$1"|"$2], B[$1"|"$2], $0; } ' fileB fileA

the fields that added to my fileA aren't correct. Something must be wrong with it. I test your codes with headings and let you know if it works.
Thanks again!

Chubler_XL · November 17, 2010, 10:21pm

Be careful you put the files on the awk command line in the correct order (FileB = your 2nd file) (fielA = your 1st file).

Here is a transcript of my test (with slight enhancement - Use FS instead of "|" in array index - thanks rdcwayx).
I kept the two arrays as this ensures the output has the same number of fields when lookup of fields1 and 2 don't find any data.

$ cat fileA
180193  99999   15960   1       18      19      3       16      11      0       54      0.01    99999   99999
180193  99999   51255   1       19      16      3       5       10      0       55      0       0       7
999999  13721   12548   1       20      9       -12     7       -1      0       66      0       0       7
180193  99999   51255   1       19      16      3       5       10      0       55      0       0       7
999999  15264   11446   1       22      10      -9      9       1       0       64      99999   99999   99999
170265  99999   15960   1       26      17      3       12      8       0       54      0.01    99999   8
$ cat fileB
150546  99999   DFKMDNBL        MFDNDVHFD       25      MH      2       2       90260   258794  1296                    
152602  99999   GFMMBDFD        DFGDGDBGB       65      RF      3       6       30268   259761  907                     
160940  99999   DFGHDGTH        BBVCSDRRG       98      WD      5       5       65923   244552  720                     
165230  99999   HHDDHRTT        GTTHDTGBH       32      AS      4       6       25430   246695  1265                    
170265  99999   RTVDVRRE        EEWFCSDFF       65      CD      9       5       26980   265986  1069                    
180193  99999   VVDBFYHK        NCMKSOSUF       25      YG      1       8       65971   245695  1089                    
184021  99999   DVGNWEPE        POSUGBNCB       98      FF      7       3       15482   256589  1315            
189750  99999   DFGGGHPL        FJFFDKSJSQ      65      DR      5       3       45681   236659  1329            
999999  13721   FREREGHH        CVFKCJUPK       35      PW      2       3       54261   210546  1122                    
999999  15264   GTUKPBCS        HGFJFJZASS      14      PK      2       5       22976   236598  1225
$ awk 'NR == FNR { A[$1 FS $2]=$9; B[$1 FS $2]=$10; OFS="\t" ; next}
   { print A[$1 FS $2], B[$1 FS $2], $0; } ' fileB fileA
65971   245695  180193  99999   15960   1       18      19      3       16      11      0       54      0.01    99999   99999
65971   245695  180193  99999   51255   1       19      16      3       5       10      0       55      0       0       7
54261   210546  999999  13721   12548   1       20      9       -12     7       -1      0       66      0       0       7
65971   245695  180193  99999   51255   1       19      16      3       5       10      0       55      0       0       7
22976   236598  999999  15264   11446   1       22      10      -9      9       1       0       64      99999   99999   99999
26980   265986  170265  99999   15960   1       26      17      3       12      8       0       54      0.01    99999

rdcwayx · November 17, 2010, 10:37pm

If there is no headings, it will be more simple:

awk 'NR==FNR {A[$1 FS $2]=$9 OFS $10; next}
$1 FS $2 in A { print A[$1 FS $2], $0; } '  OFS="\t"  FileB FileA

GoldenFire · November 17, 2010, 10:57pm

Thank you very much Chubler XL . I tested it again and I found that your code is working flawlessly and the problem is with my data. Actually, both files are space-delimited but the spacing between the feilds are not the same. That's why in some cases the other feilds are printing insead of feild 9 and 10.
Is there a way to overcome this problem?

---------- Post updated at 09:57 PM ---------- Previous update was at 09:43 PM ----------

Thank you very much rdcwayx.
Your code is just awesome! It read and printed my huge file in a second!

However, as the spacing between fields are not regular, wrong fields are printed for some rows. I think I should do something for my irregular fileA first.

Chubler_XL · November 17, 2010, 11:13pm

Awk should handle a different number of spaces between each field (as long as you have at least 1 space).

You may need to run another script over you input files to get the columns seperated, something like

awk '{print substring($0,1,12)" "substring($0,13,8)" "...etc...; }'

GoldenFire · November 17, 2010, 11:30pm

Many thanks for your kind help.
I edited the files and both codes perfectly worked.