building output file from multiple input files

ppucci · August 3, 2009, 9:37pm

Hi there,

I am trying to figure out a way to combine multiple sources with different data on a single file, and I am trying to find the best way to do it.

I have multiple files, let's say A, B, C and D. A has a field in common with B, B has a field in common with C, and C has a field in common with D.

I want the output file to have all records from file A, and to pull the fields from the other sources, for example

 
file A
2
3
 
file B (field 1 is common to A field 1)
1 10
2 20
3 30
4 40
 
file C (field 1 is common to B field 2)
10 abc
20 def
30 ghi
40 jkl
 
file D (field 1 is common to C field 2)
abc Cat
def Bird
ghi Dog
xyz Fish

The desired output file would contain

 
2 20 def Bird
3 30 ghi Dog

As I am new to this, I was first thing about running a mix of nested while read's and greps and echoing the output to a file. Is there any cleaner way to do this? Each file can be really big, and I think nesting while read's can make it slow.

Any tips will be highly appreciated!

Franklin52 · August 4, 2009, 3:32am

It's complicated to do this in one awk command with many arrays, it's more clarifying to split this in more steps:

awk 'NR==FNR{a[$1];next}
$1 in a
' fileA fileB > temp

awk 'NR==FNR{a[$2]=$0;next}
$1 in a {print a[$1], $2}
' temp fileC > temp1

awk 'NR==FNR{a[$3]=$0;next}
$1 in a {print a[$1], $2}
' temp1 fileD

rm temp temp1

ppucci · August 4, 2009, 6:56am

Thank you for your idea! Really it is not necessary to be a one-liner... just need something simple.

So far I was trying something like this:

while read field1; do
    B2='cat fileB | nawk -v fi=$field1 '{ if ( $1 == fi ) print $2 }''
    C2='cat fileC | nawk -v fi=$B2 ' { if ( $2 == fi ) print $1 }''
    D2='cat fileD | nawk -v fi=$C2 ' { if ( $2 == fi ) print $1 }''
    echo "$field1 $B2 $C2 $D2" >> output
done < fileA

(may not be actual commands, typing from memory)

but something was broken... and I had to sleep, so I'll go back at it again today. Anyway, this is not very good as it will parse all files several times, one for each record in fileA.

Regards!