Awk+Grep Input file needs to match a column and print the entire line

poliver · March 13, 2009, 4:22pm

I'm having problems since few days ago, and i'm not able to make it works with a simple awk+grep script (or other way to do this).

For example, i have a input file1.txt:

cat inputfile1.txt

218299910417
1172051195
1172070231
1172073514
1183135117
1183135118
1183135119
1281440202

And i need to match these numbers, on another file on two specific columns, for example the $3 and $4 column, using the pipe delimiter

cat inputfile2.txt

AAAAA|DISTHOR1_U2|6981258207|218299910417|END
BBBBB|DISTHOR1_U2|6981118022|6981259131|END
FARFAR|DISTHOR1_U2|6981119404|1172070231|END
CCCCC|DISTHOR1_U2|1172073514|6981258793|END
BBBBB|DISTHOR1_U2|698515487|489498131|END

The expected result, is a output file that matches the elements from the first file, with the third and forth column from the second file, in this case, the output file will be:

cat outputfile1.txt

AAAAA|DISTHOR1_U2|6981258207|218299910417|END
FARFAR|DISTHOR1_U2|6981119404|1172070231|END
CCCCC|DISTHOR1_U2|1172073514|6981258793|END

I was able to do this, with this command, but he is looking for the whole file, not a specific column:

grep -f inputfile1.txt inputfile2.txt > outputfile1.txt

But this command is taking over an hour, because my input file1.txt has over 1600 records and the inputfile2.txt has over one million of records with 190 characters on each line, divided in 43 columns

Can someone help me with this?

Thanks

vgersh99 · March 13, 2009, 4:40pm

nawk -F'|' -v OFS='|' 'FNR==NR {f1[$0]; next} $3 in f1 || $4 in f1' inputfile1.txt inputfile2.txt

chihung · March 14, 2009, 12:00am

If performance is key, you may want to use Python

#! /usr/bin/python

import sys

if len(sys.argv) != 3:
        print "Usage: %s <input1> <input2>" % (sys.argv[0])
        exit(1)

inputfile1=sys.argv[1]
inputfile2=sys.argv[2]


# store keys in list
keys=list()
for i in open(inputfile1):
        keys.append(i.strip())



for i in open(inputfile2):
        line=i.strip()
        list=line.split("|")
        if list[2] in keys or list[3] in keys:
                print line

Please post us the run time for your data set. Also, I would like to compare this with AWK version (or nawk - new AWK in Solaris)

poliver · March 16, 2009, 9:09am

thank you vgersh99, your tip is working fine, before i was taking more than one hour to process the script, and now i'm taking less than 5 minutes