Performing fast searching operations with a bash script

jitendriya.dash · June 1, 2009, 3:06am

Hi,

Here is a tough requirement , to be served by bash script.

I want to perform 3,00,000 * 10,000 searches.

i.e. I have 10,000 doc files and 3,00,000 html files in the file-system. I want to check, which of the doc files are referred in any html files. (ex- <a href="abc.doc">abc</a>) Finally, I want to remove all the doc files, which are not referenced from any of the html files.

Approach -1 :-
Initially I have tried with nested loops, outer loop on list of html files, and inner loop on list of doc files. Then, inside the inner loop, I was checking (with fgrep command) whether one file is present in one html.
# html_list :- list of all html files
# doc_file_list :- list of all doc files
# tmp_doc_file_list :- list of temp doc files
while read l_line_outer
do
while read l_line_inner
do
fgrep <file> <html>
return_code=$?
if [ $return_code -ne 0 ]
then
printf "%s\t%s\n" $l_alias_name_file $l_alias_path_file >> tmp_doc_file_list
fi
done < doc_file_list
mv tmp_doc_file_list doc_file_list
done < html_list

This approach was giving correct output, but it was taking a long time to perform this huge no. of searches.

Approach -2 :-
Then, we switched to a different logic, by launching many threads in parallel.

Outer loop on "doc_file_list" and inner loop on html_list.
under a single process, (inside the inner loop ) i was searching (fgrep) existence of one file into 30 html files at once.
I was launching 10 such processes in parallel (by using & at the end.)

The sample code is as follows.
........
.........
while read l_line_outer
do
.......
< Logic to jump the loop pointer in 10 lines basis, i.e. first loop it will start from first position file and from next loop it will start from 11th position.>
.......
while read l_line_inner
do
< Logic to jump the loop pointer in 30 lines basis, i.e. first loop it will start from first position file and from next loop it will start from 31th position. >
........
# Loop to launch multiple fgrep in parallel
for ((i=1; i<=10; i++))
do
( fgrep -s -m 1 <file{i}> <html1> <html2> <html3> ... <html30> > /dev/null ; echo $? >> thread_status{i} ) &
done
....
done < html_list
.....
<Logic to prepare the doc_file_list for the next loop and handle the multiple threads>
.....
done < doc_file_list
......
.....

However, This approach is also not working.
a ) I am getting the correct output, on small no of files/folders.
b ) While performing 300,000 * 10,000 searches, my shell script is getting dead-locked some-where, and the execution is getting halted.
c ) Even if, I am managing the dead-locking (thread management) to some exetent, it will take a long time to finish such a huge search.

Is there any alternative approach , for making this search faster, so that this search can be finished atleast in 2-3 days ?

Please help.

Thanks and Regards,

Jitendriya Dash.

JerryHone · June 1, 2009, 7:11am

Use a database!

Create a table "Table1" with columns "HTML" and "DOC" populated with one line for each document that appears in an html file.
Create another table "Table2" with a unique list of all DOCs

select DOC from Table 2 where DOC not in (select distinct DOC from Table1)

This will give a list of all unused DOCs.

jitendriya.dash · June 1, 2009, 8:51am

Thanks for the input.

Using database is a good idea, however placing the contents of all html files here into DB, (as a BLOB/CLOB field) or placing only the <a href > lines, that contains reference to any documents, into DB is a big task.

( i.e. How to insert all these lines into DB, ex- in one html file, if there are 100 <a href> lines, how to place all those lines into DB for 300,000 html files ?)

Can it be done quickly , with an execution of any linux command ?

Actually, the linux server is a 8 core processor. Is there any other way, to quicken the search/grep operation and loop operation by assigning the tasks to multiple cores ?

Please give your inputs.

Thanks and Regards,

Jitendriya Dash.

cjcox · June 1, 2009, 1:49pm

Concatenate all the contents of the html files into one big file. The do searches for the filenames in parallel over that file, on non-matches print the pattern (use grep -m 1 if using GNU grep to prevent searching the whole file on match). If you have the memory for the html concat (to be cached), should be pretty fast.