I have a file with about 50k keywords. I have a requirement to scan about 3k files to identify which filename has which keyword i.e. an output like following:
I have written a shell script which takes each of the 3k files, searches for the existence of the Keyword one by one (using grep). Once it reaches the end it starts with the next file.
I would like to know if there is a more efficient way of carrying this operation out?
What Operating System and version do you have? -- Linux .....2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:27:17 EDT 2006 i686 i686 i386 GNU/Linux
What Shell do you prefer? /bin/ksh
How big is the keywords file? 50k keywords (each keyword upto 30-40 char)
How big is the total of the 3k data files? Each file will have about 300-400 lines
Are these all normal unix text files with a reasonable record size? All text files
You appear to be attempting 150,000,000 serial file passes (15,000 x 3,000) -- Yes
Is this a one-off or something which will be run again and again? -- Not one off. This will done regularly
Do you have a full-works database engine such as Oracle? -- if found to be more efficient we can get a database engine such as Oracle
This is too complex to consider without some detailed knowledge of the record structure and what you are trying to achieve. Trying to create a relational database using flat files and shell script is asking for trouble. Really needs a Systems Analyst.