Comparing 2 files with awk and updating 2nd file

file1: (unique files)

1    /pub/atomicbk/catalog/catalog.gif 693
2    /pub/atomicbk/catalog/home.gif 813
3    /pub/atomicbk/catalog/logo2.gif 12871
4    /pub/atomicbk/catalog/sleazbk.html 18338

file2: (duplicate filenames allowed)

28/Aug/1995:00:00:38 1 /pub/atomicbk/catalog/home.gif 813
28/Aug/1995:00:00:39 1 /pub/atomicbk/catalog/catalog.gif 693
28/Aug/1995:00:00:40 1 /pub/atomicbk/catalog/logo2.gif 12871
28/Aug/1995:00:00:41 1 /pub/atomicbk/catalog/logo2.gif 12871
28/Aug/1995:00:00:42 1 /pub/atomicbk/catalog/sleazbk.html 18338
28/Aug/1995:00:00:43 1 /pub/atomicbk/catalog/catalog.gif 693

I have 2 files. file1 contains unique files, the 1st field being the FileID and the 2nd is the FileName. File2 contains the timestamp, operation type, FileName, and FileSize respectively.

Basically, what I need to do is to match the filenames of each file. If they match, I need to create a new column in file2 that stores the FileID (taken from 1st col from file1).

Basically, the resulting file2 should be this (new column is in the front):

2 28/Aug/1995:00:00:38 1 /pub/atomicbk/catalog/home.gif 813
1 28/Aug/1995:00:00:39 1 /pub/atomicbk/catalog/catalog.gif 693
3 28/Aug/1995:00:00:40 1 /pub/atomicbk/catalog/logo2.gif 12871
3 28/Aug/1995:00:00:41 1 /pub/atomicbk/catalog/logo2.gif 12871
4 28/Aug/1995:00:00:42 1 /pub/atomicbk/catalog/sleazbk.html 18338
1 28/Aug/1995:00:00:43 1 /pub/atomicbk/catalog/catalog.gif 693

I will be running this on very large files (upwards of 900,000-1,000,000 lines in file2) and (around 5000 lines in file1). So I need it to run as fast as possible. I've been struggling with this one, so I hope someone can help.

Thank you in advance!

I was thinking that maybe I could sort file2 by the 3rd column (cat file2 | sort -k 3) first. Then compare col2 in file1 with col3 in file2. Here is the pseudocode that I haven't been able to implement in awk.

Sort col3 in file2 in ascending order (cat file2 | sort -k 3)

while (i <= lastline_file2) {
if (file2.col3 == file1.col2)
     file2.newcol = file1.col1
     i++ //increment line in file2
else
     j++ //increment line in file1
}

Sort file2 back by timestamp
//Assuming the new column created in file2 was at the beginning

cat file2 | sort -k 2

Maybe there is a better way, but I'm not that great at awk yet.

Thanks!

Hi

awk 'NR==FNR{a[$2]=$1;next;}{if ($3 in a) print a[$3], $0;}'  file1 file2

Guru.

1 Like

Wow. 1 line. What can I say? That's awesome. Thanks for your help!