Performing fast searching operations with a bash script

Hi,

Here is a tough requirement , to be served by bash script.

I want to perform 3,00,000 * 10,000 searches.

i.e. I have 10,000 doc files and 3,00,000 html files in the file-system. I want to check, which of the doc files are referred in any html files. (ex- <a href="abc.doc">abc</a>) Finally, I want to remove all the doc files, which are not referenced from any of the html files.

Approach -1 :-
Initially I have tried with nested loops, outer loop on list of html files, and inner loop on list of doc files. Then, inside the inner loop, I was checking (with fgrep command) whether one file is present in one html.
# html_list :- list of all html files
# doc_file_list :- list of all doc files
# tmp_doc_file_list :- list of temp doc files
while read l_line_outer
do
while read l_line_inner
do
fgrep <file> <html>
return_code=$?
if [ $return_code -ne 0 ]
then
printf "%s\t%s\n" $l_alias_name_file $l_alias_path_file >> tmp_doc_file_list
fi
done < doc_file_list
mv tmp_doc_file_list doc_file_list
done < html_list

This approach was giving correct output, but it was taking a long time to perform this huge no. of searches.

Approach -2 :-
Then, we switched to a different logic, by launching many threads in parallel.

  1. Outer loop on "doc_file_list" and inner loop on html_list.
  2. under a single process, (inside the inner loop ) i was searching (fgrep) existence of one file into 30 html files at once.
  3. I was launching 10 such processes in parallel (by using & at the end.)

The sample code is as follows.
........
.........
while read l_line_outer
do
.......
< Logic to jump the loop pointer in 10 lines basis, i.e. first loop it will start from first position file and from next loop it will start from 11th position.>
.......
while read l_line_inner
do
< Logic to jump the loop pointer in 30 lines basis, i.e. first loop it will start from first position file and from next loop it will start from 31th position. >
........
# Loop to launch multiple fgrep in parallel
for ((i=1; i<=10; i++))
do
( fgrep -s -m 1 <file{i}> <html1> <html2> <html3> ... <html30> > /dev/null ; echo $? >> thread_status{i} ) &
done
....
done < html_list
.....
<Logic to prepare the doc_file_list for the next loop and handle the multiple threads>
.....
done < doc_file_list
......
.....

However, This approach is also not working.
a ) I am getting the correct output, on small no of files/folders.
b ) While performing 300,000 * 10,000 searches, my shell script is getting dead-locked some-where, and the execution is getting halted.
c ) Even if, I am managing the dead-locking (thread management) to some exetent, it will take a long time to finish such a huge search.

Is there any alternative approach , for making this search faster, so that this search can be finished atleast in 2-3 days ?

Please help.

Thanks and Regards,

Jitendriya Dash.

Use a database!

Create a table "Table1" with columns "HTML" and "DOC" populated with one line for each document that appears in an html file.
Create another table "Table2" with a unique list of all DOCs

select DOC from Table 2 where DOC not in (select distinct DOC from Table1)

This will give a list of all unused DOCs.

Thanks for the input.

Using database is a good idea, however placing the contents of all html files here into DB, (as a BLOB/CLOB field) or placing only the <a href > lines, that contains reference to any documents, into DB is a big task.

( i.e. How to insert all these lines into DB, ex- in one html file, if there are 100 <a href> lines, how to place all those lines into DB for 300,000 html files ?)

Can it be done quickly , with an execution of any linux command ?

Actually, the linux server is a 8 core processor. Is there any other way, to quicken the search/grep operation and loop operation by assigning the tasks to multiple cores ?

Please give your inputs.

Thanks and Regards,

Jitendriya Dash.

Concatenate all the contents of the html files into one big file. The do searches for the filenames in parallel over that file, on non-matches print the pattern (use grep -m 1 if using GNU grep to prevent searching the whole file on match). If you have the memory for the html concat (to be cached), should be pretty fast.