Awk or Perl - to selectively merge two files.

gull04 · July 4, 2012, 7:06am

I have two files, these have to be selectively merged into two other files. In addition there will require to be a edit to the last field, where the date format is changed.

The first file is a csv file with around 300k lines the data is now nearly 20 years old and I have been asked to move this to a more up to date format, it could probably be done using awk or perl and I'm quite happy to admit that it is probably beyond my limited scripting ability.

N1,GULFSTREAM AEROSPACE,G-IV,1071,FEDERAL AVIATION ADMINISTRATION,10/26/92
N1A,GOODYEAR,GZ-20A,4117,GOODYEAR TIRE & RUBBER CO,2/24/92
N1AA,CESSNA,414A,414A0482,RAGSDALE INVESTMENTS INC,4/17/91
N1AB,BRITISH AEROSPACE,BAE 125 SERIES 1000,NA1000,ALEXANDER & BALDWIN INC,6/30/92
N1AC,EAA SPORT BIPLANE,HOME BUILT,3138,EAA AVIATION FOUNDATION INC,3/31/92

The second file is a lookup file with around 75K lines, the example below has been chosen to fit.

3870219,GOODYEAR,GZ-20
3870220,GOODYEAR,GZ-20A
3980110,GULFSTREAM AEROSPACE,G1159B
3980115,GULFSTREAM AEROSPACE,G-IV
3980116,GULFSTREAM AEROSPACE,G-V
3980117,GULFSTREAM AEROSPACE,G-IV SP
2072414,CESSNA,172F
2073414,CESSNA,210D
2075907,CESSNA,414A
2075908,CESSNA,414
05605PE,CREATURA JOHN A,EAA SPORT BI P2
33601K9,EAA SPORT BIPLANE,HOME BUILT
33602TR,HARTZELL,EAA SPORT BIPLANE P2
3360339,ARLAND WARREN/DAVE,EAA SPORT BIPLANE

Each line in file one, should have an exact match on two fields in file two - however some experimentation has shown that this is not the case. So in order to see how bad the data actually is I also would need a file where there is no match.

So the expected output would be for likes with a match, field2 matches field2 and field3 matches field3. Replace field2 and field3 in file1 with field1 from file2

N1,3980115,1071,FEDERAL AVIATION ADMINISTRATION,1992-10-26
N1A,3870220,4117,GOODYEAR TIRE & RUBBER CO,1992-02-24
N1AA,2075907,414A0482,RAGSDALE INVESTMENTS INC,1991-04-17
N1AC,33601K9,3138,EAA AVIATION FOUNDATION INC,1992-03-31

Lines that have no match should just be output to an other file, without any modification.

N1AB,BRITISH AEROSPACE,BAE 125 SERIES 1000,NA1000,ALEXANDER & BALDWIN INC,6/30/92

If there is anyone out there who feels like a stab at this I'd be gratefull for any help or even just a pointer.

Regards

Dave

bartus11 · July 4, 2012, 7:21am

This should give you matched data in "match" and data without match in "no_match". It doesn't modify date field.

awk -F, -vOFS="," 'NR==FNR{a[$2","$3]=$1;next}$2","$3 in a{$2=a[$2","$3];$3=$4;$4=$5;$5=$6;NF=5;print;next}{print $0 > "no_match"}' lookup.csv data.csv > match

elixir_sinari · July 4, 2012, 8:03am

Slightly tweaked version of bartus11's script (which assumes the number of fields in the data file to be 6):

awk -F, 'BEGIN{OFS=FS} NR==FNR{a[$2$3]=$1;next} $2$3 in a{$2=a[$2$3];for(i=3;i<NF;i++) $i=$(i+1);NF-=1;print > "match";next} {print $0 > "no_match"}' lookup.csv data.csv

Scrutinizer · July 4, 2012, 8:36am

Another slight tweak:

awk -F, '{n=$2 "," $3} NR==FNR{A[n]=$1; next} !(n in A){print > "nomatch"; next} sub(n,A[n])' file2 file1 > match

gull04 · July 4, 2012, 8:56am

Hi Guys,

Thanks for the replies, I have just tried the first two. The output from the awk that bartus gave me is as follows;

[~/data/incomming]
(12:05:23)-(davem)-(1319)-> cat no_match
3870219,GOODYEAR,GZ-20
3980110,GULFSTREAM AEROSPACE,G1159B
3980116,GULFSTREAM AEROSPACE,G-V
3980117,GULFSTREAM AEROSPACE,G-IV SP
2072414,CESSNA,172F
2073414,CESSNA,210D
2075908,CESSNA,414
05605PE,CREATURA JOHN A,EAA SPORT BI P2
33602TR,HARTZELL,EAA SPORT BIPLANE P2
3360339,ARLAND WARREN/DAVE,EAA SPORT BIPLANE

[~/data/incomming]
(12:05:26)-(davem)-(1319)-> cat match
3870220,N1A,,,
3980115,N1,,,
2075907,N1AA,,,
33601K9,N1AC,,,

From elixir the output is exactly as I wanted it to look - like;

[~/data/incomming]
(13:02:09)-(davem)-(1325)-> cat match
N1,3980115,1071,FEDERAL AVIATION ADMINISTRATION,10/26/92
N1A,3870220,4117,GOODYEAR TIRE & RUBBER CO,2/24/92
N1AA,2075907,414A0482,RAGSDALE INVESTMENTS INC,4/17/91
N1AC,33601K9,3138,EAA AVIATION FOUNDATION INC,3/31/92

[~/data/incomming]
(13:02:15)-(davem)-(1326)-> cat no_match
N1AB,BRITISH AEROSPACE,BAE 125 SERIES 1000,NA1000,ALEXANDER & BALDWIN INC,6/30/92

Have yet to checkout your bit of awk scruitinizer, but I will. Thanks again to you all - you probably saved me a lot of grief.

Regards

Dave

Scrutinizer · July 4, 2012, 9:00am

With Bartus' suggestion, I think you would need to switch the order of your input files..

gull04 · July 4, 2012, 9:01am

Hi Bartus,

Just had a Doh moment - transposed the file names in the command - it works spot on - just going to bang head on wall for a few minutes.

Regards

Dave

gull04 · July 6, 2012, 7:27am

Hi Folks,

Just a final update on this for anyone that is interested, the code below is as used to correct the layout of the date field from the old format to the required one - just incase anyone should find it useful.

$> nawk -F, '
> BEGIN { OFS = "," }
>
> $6 ~ "/"  {
>         split($6, a, "/")
>         $6 = sprintf("19%s-%s-%s",
>                 sprintf("%02d", a[3]),
>                 sprintf("%02d", a[1]),
>                 sprintf("%02d", a[2]) )
>
>         print
> }' < test_file
N1,GULFSTREAM AEROSPACE,G-IV,1071,FEDERAL AVIATION ADMINISTRATION,1992-10-26
N1A,GOODYEAR,GZ-20A,4117,GOODYEAR TIRE & RUBBER CO,1992-02-24
N1AA,CESSNA,414A,414A0482,RAGSDALE INVESTMENTS INC,1991-04-17
N1AC,EAA SPORT BIPLANE,HOME BUILT,3138,EAA AVIATION FOUNDATION INC,1992-03-31

$> cat te*
N1,GULFSTREAM AEROSPACE,G-IV,1071,FEDERAL AVIATION ADMINISTRATION,10/26/92
N1A,GOODYEAR,GZ-20A,4117,GOODYEAR TIRE & RUBBER CO,2/24/92
N1AA,CESSNA,414A,414A0482,RAGSDALE INVESTMENTS INC,4/17/91
N1AC,EAA SPORT BIPLANE,HOME BUILT,3138,EAA AVIATION FOUNDATION INC,3/31/92

Again guys thanks for the help.

Regards

Dave