I have around 300 files(*.rdf,*.fmb,*.pll,*.ctl,*.sh,*.sql,*.prog) which are of large size.
Around 8000 keywords(which will be in the file $keywordfile) needed to be searched inside those files.
If a keyword is found in a file..I have to insert the filename,extension,catagoery,keyword,occurrence to database.
I have implemented following code..but it is taking around 10-12 hours complete.
Could you please suggest how will i change it , so that it will be faster.
I am using Solaris .
/usr/xpg4/bin/find $tmpdir -type f -name "*.rdf" -o -name "*.fmb" -o -name "*.pll" -o -name "*.ctl" -o -name "*.sh" -o -name "*.sql" -o -name "*.prog"| while read filename
do
while read keyword
do
matchCount=`/usr/xpg4/bin/grep -F -i -x "$keyword" "$filename" | wc -l`
if [ $matchCount -ne 0 ];then
out3=`echo "$filename"|awk -F\. '{print $2}'`
bfilename=`basename "$filename"`
case $out3 in
'rdf') catagoery="REPORT";;
'fmb') catagoery="FORM";;
'sql') catagoery="SQL FILE";;
'pll') catagoery="Library File";;
'ctl') catagoery="Control File";;
'sh') catagoery="Shell script";;
*) catagoery="OTHER";;
esac
echo "bfilename,keyword,matchCount,out3,catagoery are:- $bfilename,$keyword,$matchCount,$out3,$catagoery"
sqlplus -s $usrname/$password@$dbSID <<-SQL >> spot_fsearch.log
INSERT INTO AA_DETAIL (FILE_NAME,DEP_OBJECT_NAME,OCCURANCE,FILE_TYPE,PROGRAM_TYPE) values ('$bfilename','$keyword',$matchCount,'$out3','$catagoery');
UPDATE BB_DETAIL SET (DEP_OBJECT_TYPE,MODULE_SHORT_NAME,APPLICATION,OBJECT_STATUS,OBJ_ADDN_INFO) = (SELECT OBJECT_TYPE,MODULE_SHORT_NAME,APPLICATION,OBJECT_STATUS,OBJ_ADDN_INFO FROM CG_COMPARATIVE_MATRIX_TAB WHERE upper(OBJECT_NAME)=upper('$keyword') AND ROWNUM<2) WHERE upper(DEP_OBJECT_NAME) = upper('$keyword');
UPDATE CC_CUSTOM_FILES_SUMMARY SET IMPACTED_BY_UPGRADE='$out2' WHERE FILE_NAME='$bfilename';
quit;
SQL
fi
done < $keywordfile
done
Searching 8000 keywords in 300 large files is quite something, but the program you show can be optimized for speed.
a) Don't open and reread the keyword file line by line for every file matching your pattern.
b) Don't run the grep process for every single keyword/file combination (300 x 8000 = 2.4 million times!)
c) Don't use wc -l piped to the greps (again 2.4 million times)
d) Don't run the sql command including login for every single keyword/file combination; collect the results into a file and insert & update afterwards.
This is untested and far from complete; you need to experiment. It should replace your two while loops as it reads all the keywords, and then scans all the files found by your find command. It will produce an output that you can capture into a file that you can sqlload into your DB in one go; thereafter do the inserts and updates:
awk 'BEGIN {CAT["rdf"]="REPORT"
CAT["fmb"]="FORM"
CAT["sql"]="SQL FILE"
CAT["pll"]="Library File"
CAT["ctl"]="Control File"
CAT["sh"]= "Shell script"
}
FNR == NR {KY[$0]; next} # read in all the keywords
FNR == 1 && FN {EXT = FN; sub (/.*\./,".", EXT) # if new file, obtain the extension
for (i in MCNT) # for all matches,
print FN, i, MCNT, EXT, CAT[EXT] # print out the old values
FN = FILENAME # retain FILENAME for next loop
}
{for (i in KY) if ($0 ~ i) MCNT++} # find matching keywords in each line
END {EXT = FN; sub (/.*\./,".", EXT) # same as above for last file
for (i in MCNT)
print FN, i, MCNT, EXT, CAT[EXT]
}
' $keywordfile $(find $tmpdir -type f -name ....) # may blast your LINE_MAX
I ran the the code which gave me error near the find command.
syntax error at line 25: `(' unexpected
So i have backquoted the find command and run as below.
keywordfile="keyword.txt"
/usr/xpg4/bin/awk 'BEGIN {CAT["rdf"]="REPORT"
CAT["fmb"]="FORM"
CAT["sql"]="SQL FILE"
CAT["pll"]="Library File"
CAT["ctl"]="Control File"
CAT["sh"]= "Shell script"
}
FNR == NR {KY[$0]; next} # read in all the keywords
FNR == 1 && FN {EXT = FN; sub (/.*\./,".", EXT) # if new file, obtain the extension
for (i in MCNT) # for all matches,
print FN, i, MCNT, EXT, CAT[EXT] # print out the old values
FN = FILENAME # retain FILENAME for next loop
}
{for (i in KY) if ($0 ~ i) MCNT++} # find matching keywords in each line
END {EXT = FN; sub (/.*\./,".", EXT) # same as above for last file
for (i in MCNT)
print FN, i, MCNT, EXT, CAT[EXT]
}
' $keywordfile `/usr/xpg4/bin/find /usr/tmp/SB -type f -name "*.rdf" -o -name "*.fmb" -o -name "*.pll" -o -name "*.ctl" -o -name "*.sh" -o -name "*.sql" -o -name "*.prog"`
but it is giving me the error as below.
/usr/xpg4/bin/awk: line 16 (NR=7758): /DR$PV_ENTY_ATTR_TEXTS_U2$R/: unknown regex error
And i checked the keyword file and can see some of keywords contain $ symbol.So it is breaking.
And also some filenames contains space.
Please let me know what modification i should do here.
As I said: You need to experiment. Try printing the lines with matches. Try smaller files.
Why don't you create a, say, 10 keyword file, and work on a subset of two or three sample files that have a known set of keywords within?
The error msg you post points to the END section, i.e. the problem is within the last file. Which can be good news, as all the earlier files passed!
If you do more than just ask us to do your work for you, you'll find us more willing to lend assistance.
If Rudi's code did not work for you, what did you do to try to remedy the shortcomings? If nothing, don't expect much from us. I can only speak for myself (but I suspect others share the sentiment) when I say that I prefer to help those that help themselves.
I suggest to run the grep with -f $keywordfile inputfiles...
That starts grep less often, and opens each inputfile once.
The post-processing is a bit more awkward:
PATH=/usr/xpg4/bin:${PATH}
export PATH
find $tmpdir -type f \( -name "*.rdf" -o -name "*.fmb" -o -name "*.pll" -o -name "*.ctl" -o -name "*.sh" \
-o -name "*.sql" -o -name "*.prog" \) -exec grep -F -i -x -f $keywordfile /dev/null {} + |
# the /dev/null guarantees >=2 arguments so grep always returns filename:matchword
# fold matched keywords to lowercase and remove duplicates and add matchcount
awk -F":" '{k2=tolower(substr($0,length($1)+1))} {c[$1 k2]++} END {for (i in c) print c FS i}' |
while IFS=":" read matchCount filename keyword
do
out3=`echo "$filename"|awk -F\. '{print $NF}'`
bfilename=`basename "$filename"`
case $out3 in
'rdf') catagoery="REPORT";;
'fmb') catagoery="FORM";;
'sql') catagoery="SQL FILE";;
'pll') catagoery="Library File";;
'ctl') catagoery="Control File";;
'sh') catagoery="Shell script";;
*) catagoery="OTHER";;
esac
echo "bfilename,keyword,matchCount,out3,catagoery are:- $bfilename,$keyword,$matchCount,$out3,$catagoery"
# SQL stuff follows
done