Compare and Merge files

Hi All,
I have two different files as shown below separated by a "|". I need to compare the first column from both the files and if they match merge both the columns.

File 1

"S00172012"|"CHRONIC RENAL FAILURE"|""|"I"
"S00159962"|"SUBENDO INFRC-INIT EPISD"|""|"I"
"S00255303"|"BENIGN NEOPLASM LG BOWEL"|""|"I"
"S00012361"|"DIABETES MELLITUS"|""|"I"
"S00052174"|"ALZHEIMER'S DISEASE"|""|"I"
"S00035358"|"SPRAIN OF KNEE & LEG NEC"|""|"I"
"S00058846"|"DIAB EYE MANIF TYPE I"|""|"I"
"S00050977"|"ALCOHOLIC GASTRITIS"|""|"I"
"S00005471"|"FX OLECRAN PROC ULNA-CL"|""|"I"
"S00005471"|"FX OLECRAN PROC ULNA-CL"|""|"I"
"S00026563"|"CONVULSIONS"|""|"I"
"S00008617"|"MALIGN NEOPL BREAST NOS"|""|"I"
"S00005471"|"FX OLECRAN PROC ULNA-CL"|""|"I"

File 2

"S00172012"|"00000274AWJO"|"904395"|"Y"|"20101001"|"420.90"|"262"|"584.9"|"576.1"|"428.0"|"287.5"|"427.31"|"562.11"|"574.51"|"2200"|"20110318"|""|"20110318"|"DRG"|"2200"|"Acute Care Inpatient R&B"
"S00159962"|"00000264B0NN"|"900003"|"Y"|"20100921"|"441.4"|"V45.89"|""|""|""|""|""|""|""|"6120"|"20110401"|""|"20110401"|"APC"|"6120"|"Radiology-MF POS"
"S00255303"|"00000264B0NN"|"900003"|"Y"|"20100921"|"441.4"|"V45.89"|""|""|""|""|""|""|""|"6160"|"20110401"|""|"20110401"|"APC"|"6160"|"Hospital Pharmacy-Drug-MF POS"
"S00012361"|"00000253AOUF"|"900003"|"Y"|"20100910"|"786.50"|"496"|"562.10"|"V12.72"|"600.00"|"285.9"|"564.00"|""|""|"1520"|"20110401"|""|"20110401"|"APC"|"1520"|"Observation Room-Fac"
"S00052174"|"00000253AOUF"|"900003"|"Y"|"20100910"|"786.50"|"496"|"562.10"|"V12.72"|"600.00"|"285.9"|"564.00"|""|""|"5135"|"20110401"|""|"20110401"|"APC"|"5135"|"Prev: Smoking Cessation"
"S00002340"|"00000253AOUF"|"900003"|"Y"|"20100910"|"786.50"|"496"|"562.10"|"V12.72"|"600.00"|"285.9"|"564.00"|""|""|"5840"|"20110401"|""|"20110401"|"APC"|"5840"|"Laboratory & Pathology-MF POS"
"S00002340"|"00000253AOUF"|"900003"|"Y"|"20100910"|"786.50"|"496"|"562.10"|"V12.72"|"600.00"|"285.9"|"564.00"|""|""|"6600"|"20110401"|""|"20110401"|"APC"|"6600"|"Venipuncture-Fac-MF POS"
"S00002364"|"00000035ARXB"|"903700"|"Y"|"20100204"|"802.4"|"812.01"|"294.8"|"427.31"|"802.0"|"412"|"250.00"|"401.9"|"272.4"|"1920"|"20110416"|""|"20110416"|""|"1920"|"Laboratory & Pathology-HF POS"
"S00002364"|"00000035ARXB"|"903700"|"Y"|"20100204"|"802.4"|"812.01"|"294.8"|"427.31"|"802.0"|"412"|"250.00"|"401.9"|"272.4"|"2120"|"20110416"|""|"20110416"|""|"2120"|"Physical Therapy Oth Prov"
"S00002364"|"00000035ARXB"|"903700"|"Y"|"20100204"|"802.4"|"812.01"|"294.8"|"427.31"|"802.0"|"412"|"250.00"|"401.9"|"272.4"|"2200"|"20110415"|""|"20110415"|"DRG"|"2200"|"Acute Care Inpatient R&B"
"S00002364"|"00000035ARXB"|"903700"|"Y"|"20100204"|"802.4"|"812.01"|"294.8"|"427.31"|"802.0"|"412"|"250.00"|"401.9"|"272.4"|"2360"|"20110416"|""|"20110416"|""|"2360"|"Radiology-HF POS"
"S00002364"|"00000035ARXB"|"903700"|"Y"|"20100204"|"802.4"|"812.01"|"294.8"|"427.31"|"802.0"|"412"|"250.00"|"401.9"|"272.4"|"2400"|"20110416"|""|"20110416"|""|"2400"|"Hospital Pharmacy-Drug-Inpt"
"S00002364"|"00000035ARXB"|"903700"|"Y"|"20100204"|"802.4"|"812.01"|"294.8"|"427.31"|"802.0"|"412"|"250.00"|"401.9"|"272.4"|"3999"|"20110416"|""|"20110416"|""|"3999"|"Default Med Def-HF Risk"
"S00003163"|"00000133ARSG"|"900008"|"Y"|"20100513"|"780.4"|"530.81"|"723.1"|""|""|""|""|""|""|"1840"|"20110311"|""|"20110311"|"APC"|"1840"|"Emergency Room-Fac"
"S00003163"|"00000133ARSG"|"900008"|"Y"|"20100513"|"780.4"|"530.81"|"723.1"|""|""|""|""|""|""|"1920"|"20110311"|""|"20110311"|"APC"|"1920"|"Laboratory & Pathology-HF POS"
"S00003163"|"00000133ARSG"|"900008"|"Y"|"20100513"|"780.4"|"530.81"|"723.1"|""|""|""|""|""|""|"1980"|"20110311"|""|"20110311"|"APC"|"1980"|"EKG - ER"
"S00003163"|"00000133ARSG"|"900008"|"Y"|"20100513"|"780.4"|"530.81"|"723.1"|""|""|""|""|""|""|"2360"|"20110311"|""|"20110311"|"APC"|"2360"|"Radiology-HF POS"
"S00003163"|"00000133ARSG"|"900008"|"Y"|"20100513"|"780.4"|"530.81"|"723.1"|""|""|""|""|""|""|"2920"|"20110311"|""|"20110311"|"APC"|"2920"|"Venipuncture-Fac-HF POS"
"S00003552"|"00000281ASID"|"900003"|"Y"|"20101008"|"275.42"|"252.00"|""|""|""|""|""|""|""|"6120"|"20110415"|""|"20110415"|"APC"|"6120"|"Radiology-MF POS"
"S00003842"|"00000068B9QM"|"904395"|"Y"|"20100309"|"820.21"|"584.9"|"276.2"|"263.9"|"428.0"|"416.9"|"427.32"|"276.51"|"427.31"|"2200"|"20110422"|""|"20110422"|"DRG"|"2200"|"Acute Care Inpatient R&B"
"S00003968"|"00000188B0TP"|"900008"|"Y"|"20100707"|"787.20"|"427.31"|"401.9"|"414.01"|""|""|""|""|""|"1840"|"20110325"|""|"20110325"|"APC"|"1840"|"Emergency Room-Fac"
"S00003968"|"00000188B0TP"|"900008"|"Y"|"20100707"|"787.20"|"427.31"|"401.9"|"414.01"|""|""|""|""|""|"1840"|"20110415"|""|"20110415"|"APC"|"1840"|"Emergency Room-Fac"
"S00003968"|"00000188B0TP"|"900008"|"Y"|"20100707"|"787.20"|"427.31"|"401.9"|"414.01"|""|""|""|""|""|"1920"|"20110325"|""|"20110325"|"APC"|"1920"|"Laboratory & Pathology-HF POS"

Desired output :

"S00172012"|"CHRONIC RENAL FAILURE"|""|"I"|"00000274AWJO"|"904395"|"Y"|"20101001"|"420.90"|"262"|"584.9"|"576.1"|"428.0"|"287.5"|"427.31"|"562.11"|"574.51"|"2200"|"20110318"|""|"20110318"|"DRG"|"2200"|"Acute Care Inpatient R&B" 
"S00159962"|"SUBENDO INFRC-INIT EPISD"|""|"I"|"00000264B0NN"|"900003"|"Y"|"20100921"|"441.4"|"V45.89"|""|""|""|""|""|""|""|"6120"|"20110401"|""|"20110401"|"APC"|"6120"|"Radiology-MF POS"
"S00255303"|"BENIGN NEOPLASM LG BOWEL"|""|"I"|"00000264B0NN"|"900003"|"Y"|"20100921"|"441.4"|"V45.89"|""|""|""|""|""|""|""|"6160"|"20110401"|""|"20110401"|"APC"|"6160"|"Hospital Pharmacy-Drug-MF POS"
"S00012361"|"DIABETES MELLITUS"|""|"I"|"00000253AOUF"|"900003"|"Y"|"20100910"|"786.50"|"496"|"562.10"|"V12.72"|"600.00"|"285.9"|"564.00"|""|""|"1520"|"20110401"|""|"20110401"|"APC"|"1520"|"Observation Room-Fac"
"S00052174"|"ALZHEIMER'S DISEASE"|""|"I"|"00000253AOUF"|"900003"|"Y"|"20100910"|"786.50"|"496"|"562.10"|"V12.72"|"600.00"|"285.9"|"564.00"|""|""|"5135"|"20110401"|""|"20110401"|"APC"|"5135"|"Prev: Smoking Cessation"

I tried this using the following awk code but no luck.

Tried code :

awk -F, 'NR==FNR { a[$2]=$0; next  }
           a[$2] { print $0, a[$2] }
        ' file2 file1 > newfile

Any help will be appreciated!!

In what way did it not work?

It created an empty output file

(rk3388) => awk -F, 'NR==FNR { a[$2]=$0; next  }
>            a[$2] { print $0, a[$2] }
>         ' file2 file1 > newfile
(rk3388) => more newfile
(rk3388) =>

On third look, I see a bug in that:

-F,

That tells it to use , as the separator.

Your file needs -F'|' instead.

1 Like

I have code in c, cpp, and python that does this, but I have never tried it in something like awk. Of course, my cpp code is like 300 lines and the interperter version is probably one line. I would be happy to post the src code if you want to compile it and give it a try. You specify the two files and the field to merge on (index). It first counts the number of lines in both files to make sure they are the same (throws exception if they are not), then it compares line by line to make sure the the merge field is the same for both files. If the fields match, the lines are merged for output. The index col is only included once in the output, meaning it's skipped in the output of the second file.

LMHmedchem

Thanks LMHmedchem But I need to do this in awk or sed.

Also found another way of doing it , incase it is helpful for others.

awk -v input_file=file1 'BEGIN { FS="|"; while((getline<input_file)>0){sub1=substr($0,1,11); subArr[sub1]=""}}{ if ($1 in subArr) print $0;}' file2  > file3

.

1 Like

I used a hard language because I needed some specific exception handling and a variety of checks and such. I also have a version of this where one file is always bigger than the other and the code searches the bigger file for the matching row. I am sure there are ways to do this kind of thing with stream interpreters, but that's well past what I know how to do.

Thanks for the post, it is always nice to see a different way. It's terribly funny that these three lines do, in essence, what it takes me 300 lines to do in cpp, plus your version will work on most any os without needing to be recompiled.

LMHmedchem