Search substring in a column of file

Hi all,

I have 2 files, the first one containing a list of ids and the second one is a master file. I want to search each id from the first file from the 5th col in the second file. The 5th column in master file has values separated by ';', if not a single value is present.

Each id must occur just once as a sub-string of $5 in the master file (which is a 20GB file).
I realize that I can use

grep -m 1 -f idfile masterfile

but it takes awfully long and it will give me all other strings that contain the id.
I can also do

awk -F"\t" ' { if ($5 ~ /one_value/) print $0} ' masterfile

but I do not know how to pass a file instead of just a single value in awk.Also it does not solve the problem of returning all strings that contain the id instead of just 1 that I want.

Can there be a smarter and faster way to do this? use join maybe?

idfile

1
2
3

masterfile

a b c d 12;34;1 e
d e f g 67;2;567 h
i j k l 3;123 m
n o p q 321;231 r
s t u v 1223 x

output

1 a b c d 12;34;1 e
2 d e f g 67;2;567 h
3 i j k l 3;123 m

Why are you setting -F"\t" in your awk script when there aren't any tabs in your input files?

The following seems to produce the output you want for your sample input:

awk '
NR == FNR {
        id[$1]
        next
}
{       n = split($5, x, /;/)
        for(i = 1; i <= n; i++)
                if(x in id)
                        print x, $0
}' idfile masterfile

If you want to try this on a Solaris/SunOS system, change awk to nawk , /usr/xpg4/bin/awk , or /usr/xpg6/bin/awk .

2 Likes

Or this string trick

awk '
NR == FNR {
        id[";"$1";"]
        next
}
{      x = ";"$5";"
        for (i in id)
                if (index(x,i))
                        print i, $0
}' idfile masterfile

Would be interesting if this is faster than Don's one. Or if index(x,i) is faster than x ~ i ...

1 Like