How to match in 2 files and generate 3rd file?

reddyr · July 17, 2014, 2:32pm

Hello,

I have 2 tables (first file with colon separated, second file comma separated) like below:

Please note that the matching number (kind of primary key) is a number and is NOT unique. It is 2nd column in table1 and 4th column in table2.

# cat table1
vgbpjdata1:80
vgbpjdata2:50
vgbpjdata3:50
vgbpjdata4:80

# cat table2
vpath50,c54t4d2,52428800,50
vpath51,c40t4d3,83886080,80
vpath52,c40t4d0,83886080,80
vpath56,c36t4d1,52428800,50

# cat output_file
MY_CMD vgbpjdata1 /dev/dsk/c40t4d3
MY_CMD vgbpjdata2 /dev/dsk/c54t4d2
MY_CMD vgbpjdata3 /dev/dsk/c36t4d1
MY_CMD vgbpjdata4 /dev/dsk/c40t4d0

Please help, thanks!

Chubler_XL · July 17, 2014, 3:05pm

You could use awk like this:

awk '
FS==","{ k[$4]=($4 in k? k[$4]"," : "") "/dev/dsk/" $2; next}
$2 in k {
  dev=k[$2]
  if (sub(",.*",x,dev)) sub(dev",",x,k[$2])
  else delete k[$2]
  print "MY_CMD",$1,dev
} ' FS=, table2 FS=: table1

reddyr · July 17, 2014, 3:17pm

Brilliant, worked great. Thanks a million.

RudiC · July 17, 2014, 3:39pm

Depending on the awk version, this will not work; on my linux mawk the result is:

MY_CMD vgbpjdata1 
MY_CMD vgbpjdata2 
MY_CMD vgbpjdata3 /dev/dsk/c54t4d2
MY_CMD vgbpjdata4 /dev/dsk/c40t4d3

And, how can you tell which data set will be selected if you don't have a unique key to refer to it? Should that be in order of appearance (as in Chubler_XL's proposal) or is that sheer coincidence?

Chubler_XL · July 17, 2014, 4:17pm

Thanks RudiC, I was cringing a bit when I wrote that assignment, but it seemed to work OK.

This should be a more portable variation:

awk '
FS==","{ if($4 in k) k[$4]=k[$4]","; k[$4]=k[$4] "/dev/dsk/" $2; next}
$2 in k {
  dev=k[$2]
  if (sub(",.*",x,dev)) sub(dev",",x,k[$2])
  else delete k[$2]
  print "MY_CMD",$1,dev
} ' FS=, table2 FS=: table1

or even this:

awk '
  FS==","{ m[$4,++k[$4]]="/dev/dsk/" $2; next}
  k[$2] {print "MY_CMD",$1, m[$2,k[$2]--]} ' FS=, table2 FS=: table1

reddyr · July 17, 2014, 5:27pm

Thanks Chubler_XL and RudiC

It is NOT coincidence but the order of appearance (as shown in my sample output).

Chubler_XL's all 3 solutions worked for me. I'm using standard awk of HP UX. I tested with few samples and all worked fine. Could you please confirm which is the best and most likely to work on "all" samples?

Thanks again!

Chubler_XL · July 17, 2014, 5:38pm

I'd avoid solution #1 as there is no standards definition on what order assignment statements are processed so some awk implementations and possibly future updates to awk could cause it to fail. Both 2 and 3 are fine and it's you preference, whichever you find easier to understand.