Finding 50k Keywords in 3k files

rjains · May 9, 2011, 1:10pm

Hi,

I have a file with about 50k keywords. I have a requirement to scan about 3k files to identify which filename has which keyword i.e. an output like following:

File1,Keyword1
File1,Keyword2
File3,Keyword1
.....

I have written a shell script which takes each of the 3k files, searches for the existence of the Keyword one by one (using grep). Once it reaches the end it starts with the next file.

I would like to know if there is a more efficient way of carrying this operation out?

Thanks in advance for your help.

turk451 · May 9, 2011, 1:42pm

That seems efficient to me, but you may want to try xargs or something to simplify your script. What have you written to accomplish the task?

methyl · May 10, 2011, 8:00am

What Operating System and version do you have?
What Shell do you prefer?

How big is the keywords file?
How big is the total of the 3k data files?
Are these all normal unix text files with a reasonable record size?

You appear to be attempting 150,000,000 serial file passes (15,000 x 3,000) .

Is this a one-off or something which will be run again and again?

Do you have a full-works database engine such as Oracle?

rjains · May 10, 2011, 11:06am

What Operating System and version do you have? -- Linux .....2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:27:17 EDT 2006 i686 i686 i386 GNU/Linux

What Shell do you prefer? /bin/ksh

How big is the keywords file? 50k keywords (each keyword upto 30-40 char)
How big is the total of the 3k data files? Each file will have about 300-400 lines
Are these all normal unix text files with a reasonable record size? All text files

You appear to be attempting 150,000,000 serial file passes (15,000 x 3,000) -- Yes

Is this a one-off or something which will be run again and again? -- Not one off. This will done regularly

Do you have a full-works database engine such as Oracle? -- if found to be more efficient we can get a database engine such as Oracle

methyl · May 10, 2011, 1:02pm

This is too complex to consider without some detailed knowledge of the record structure and what you are trying to achieve. Trying to create a relational database using flat files and shell script is asking for trouble. Really needs a Systems Analyst.