Find common Strings in two large files

kanthrajgowda · December 20, 2010, 9:40am

Hi ,
I have a text file in the format

DB2: [NodeID=1]
DB2: [NodeID=2]
WB: [NodeID=3]
WB: [NodeID=3]
WB: [NodeID=3]
WB: [NodeID=3]

and a second text file of the format

Time=00:00:00.473 [NodeID=3]
Time=00:00:00.436 [NodeID=3]
Time=00:00:00.016 [NodeID=2]
Time=00:00:00.027 [NodeID=1]
Time=00:00:00.471 [NodeID=3]
Time=00:00:00.436 [NodeID=3]

the last string in both the text files is of the form NodeID=*

I want to combine lines in both the files where the last string in both the files matches ....
something like

DB2: [NodeID=1] Time=00:00:00.027 [NodeID=1]

could you please suggest...

NOTE: the actual size of the text files runs into GBs...

Thanks in advnce.

anurag.singh · December 20, 2010, 10:00am

 
awk 'NR==FNR{a[$2]=$0;next;}{print $0,a[$2]}' secondFile firstFile

kanthrajgowda · December 20, 2010, 10:49am

Anurag,
Thanks for quick response. As I am beginer in Shell need help in understanding the following

awk 'NR==FNR{a[$2]=$0;next;}{print $0,a[$2]}' secondFile firstFile

In the above code

What does {a[$2]=$0;next;} $2 and $0 stand for...?

so That I can modify your script to make it working...

Thanks

anurag.singh · December 20, 2010, 11:14am

Here is the explaination of above command (Go through any awk tutorial to get a basic idea of how awk works.)
awk processes input file line by line and in each line, $0 represents the whole line, $1 represents 1st field, $2 represents 2nd field and so on. Default delimiter is space/tab.

echo "abc def ghi" | awk '{print $0}'

will prints whole line

abc def ghi

echo "abc def ghi" | awk '{print $1}'

will prints 1st field

abc

echo "abc def ghi" | awk '{print $2}'

will prints 2nd field

def

When more than one file is given to awk,

 
NR==FNR

will be true only for 1st file. FNR is record no in current file, NR is record no processed by awk (so commulative count).

 
NR==FNR{a[$2]=$0;next;}

This block will execute only for 1st file (As NR==FNR will be true only for 1st file). Here record value is being assigned to array (indexed with 2nd field i.e. [NodeID=1/2/3]). next command will get next line in the file for processing.

{print $0,a[$2]}

This block will execute for 2nd file ONLY.
Here $0 is the whole current record value in 2nd file and $2 is 2nd field in current line (i.e. NodeID in 2nd file). a[$2] will be printed if array was set for this index (2nd field in 2nd file) while processing 1st file.

kanthrajgowda · December 21, 2010, 3:38am

Anurag,
Thanks for all the support - Now scripts are able to deliver ....Thanks a TON