Grep multiple strings in multiple files

xshang · November 5, 2012, 12:01pm

Hi, every one!

I have a file with multiple strings.
file1

ATQRGNE
ASQGVKFTE
ASSQYRDRGGLET
SPEQGARSDE
ASSRDFTDT
ASSYSGGYE
ASSYTRLWNTGE
ASQGHNTD
PSLGGGNQPQH
SLDRDSYNEQF

I want to grep each string in hundreds of files in the same directory, further, I want to find out the string which occurs in more than 50 files.

Can I use grep in awk to do that? I tried but fail to import each string into grep.

Thanks in advance!

Yoda · November 5, 2012, 12:11pm

I don't recommend this approach, but I hope it works:-

grep -f search_file * | awk -F":" ' { print $NF } ' | sort | uniq -c | awk ' { if($1>=50) print; } '

Note: search_file is the file with multiple strings that you mentioned.

pamu · November 5, 2012, 12:19pm

try something like this..

It may take some time.

while read line
do
c=0
while read files
do
grep "$line" "$files" && c++
done<All_file_list
if [[ "$c" -gt 50 ]]
then
echo "$line 50 times"
fi
done<search_file

xshang · November 5, 2012, 12:22pm

bipinajith:

I don't recommend this approach, but I hope it works:-
grep -f search_file * | awk -F":" ' { print $NF } ' | sort | uniq -c | awk ' { if($1>=50) print; } '
Note: search_file is the file with multiple strings that you mentioned.

Thanks!

Yes, this approach might not be efficiency. there are millions of strings in my file and hundreds of files to be searched. Do you have any other smart suggestions?

Thank you very much.

Yoda · November 5, 2012, 12:30pm

I am not sure if I can recommend another smart approach, but I believe my approach will run much faster than pamu's script since he/she is using looping structures.

Did you try running both & check if you are getting the desired results?

xshang · November 5, 2012, 12:31pm

Yes, I'm working on that. Thank you. I will make a response later.

vgersh99 · November 5, 2012, 12:33pm

grep -f search_file * | awk -F: '{a[$2]++} END {for (i in a) if (a>=50) print i}'

xshang · November 5, 2012, 1:01pm

Nice, Thank you!

---------- Post updated at 02:01 PM ---------- Previous update was at 01:44 PM ----------

It works! vgersh99 gives a similar solution. Both of your codes runs faster.

However, what I want exactly is the string and the file numbers which each string appears in.
I'm studying your code and try to figure out how can I get what I want.