Issue when using egrep to extract strings (too many strings)

forevertl · April 20, 2016, 5:56pm

Dear all,
I have a data like below (n of rows=400,000) and I want to extract the rows with certain strings. I use code below. It works if there is not too many strings for example n of strings <5000. while I have 90,000 strings to extract. If I use the egrep code below, I will get error:

error:

 /usr/bin/egrep: Argument list too long

data example:

 
 ILMN_167228 9.523 1.599 8.518
 ILMN_168228 8.823 2.599 8.518
 ILMN_169228 8.023 3.599 8.518
 ILMN_1751228 8.423 4.599 8.518
 ILMN_7751228 8.323 5.599 8.518
 ILMN_1881228 8.223 8.599 8.518

...

code example:

 
  
 egrep '(ILMN_2258774|ILMN_1700477|...|ILMN_1805992)' test1>test2

I got error since I have too many number of strings (n=80,000) to extract.

error:

 /usr/bin/egrep: Argument list too long

any one know how to fix it or any other way that can handle my request? Thank you.

Scrutinizer · April 20, 2016, 6:34pm

You could try putting those strings in a file, like so:

ILMN_2258774
ILMN_1700477
...
ILMN_1805992

Then you can extract like so:

grep -f stringfile test1>test2

For accuracy it would be better to use anchoring, by using a single space after each of the strings (ILMN_ is unique enough so the does not need to be a ^ in front) , to avoid possible false positives because of substring matches, unless all strings have the same length:

ILMN_2258774 
ILMN_1700477 
...
ILMN_1805992

--
On Solaris use /usr/xpg4/bin/grep

MadeInGermany · April 21, 2016, 1:44am

There might be a performance problem also, because grep might do the comparisons one after the other, like a loop would do, for each line. Anchoring makes each comparison only a little faster.
If you have a plain stringlist file without RE wildcards and without spaces, while your main file is space separated and your strings should match the first field, then a hash is much faster. With awk

awk 'NR==FNR {A[$1]; next} ($1 in A)' stringfile test1>test2

forevertl · April 21, 2016, 10:46am

Thank you guys. I tried the code below and it works.

awk 'NR==FNR {A[$1]; next} ($1 in A)' stringfile test1>test2