Dear all,
I have a data like below (n of rows=400,000) and I want to extract the rows with certain strings. I use code below. It works if there is not too many strings for example n of strings <5000. while I have 90,000 strings to extract. If I use the egrep code below, I will get error:
error:
/usr/bin/egrep: Argument list too long
data example:
ILMN_167228 9.523 1.599 8.518
ILMN_168228 8.823 2.599 8.518
ILMN_169228 8.023 3.599 8.518
ILMN_1751228 8.423 4.599 8.518
ILMN_7751228 8.323 5.599 8.518
ILMN_1881228 8.223 8.599 8.518
...
code example:
egrep '(ILMN_2258774|ILMN_1700477|...|ILMN_1805992)' test1>test2
I got error since I have too many number of strings (n=80,000) to extract.
error:
/usr/bin/egrep: Argument list too long
any one know how to fix it or any other way that can handle my request? Thank you.
You could try putting those strings in a file, like so:
ILMN_2258774
ILMN_1700477
...
ILMN_1805992
Then you can extract like so:
grep -f stringfile test1>test2
For accuracy it would be better to use anchoring, by using a single space after each of the strings (ILMN_ is unique enough so the does not need to be a ^ in front) , to avoid possible false positives because of substring matches, unless all strings have the same length:
ILMN_2258774
ILMN_1700477
...
ILMN_1805992
--
On Solaris use /usr/xpg4/bin/grep
1 Like
There might be a performance problem also, because grep might do the comparisons one after the other, like a loop would do, for each line. Anchoring makes each comparison only a little faster.
If you have a plain stringlist file without RE wildcards and without spaces, while your main file is space separated and your strings should match the first field, then a hash is much faster. With awk
awk 'NR==FNR {A[$1]; next} ($1 in A)' stringfile test1>test2
1 Like
Thank you guys. I tried the code below and it works.
awk 'NR==FNR {A[$1]; next} ($1 in A)' stringfile test1>test2