I'm searching the most effective way of doing the following task, so if someone can either provide a working solution with sed or one totally different but more effective then what I've got so far then please go ahead!
The debugme directory has 3 subdirectorys and each of them has one .txt file with about 48 entrys each.
time (
for FILE1 in `find debugme -name "*.txt"` ;do
for FILE2 in `cat "$FILE1" | awk '{print $1}' | grep -i '^\([0-9]\+\)$'` ;do
#for FILE2 in `sed 's/\([0-9]\+\) \([a-zA-Z0-9]\+\)/\1/i' "$FILE"` ;do
CHECK=`grep "$FILE2" "debug.files"`
if [ "$CHECK" = "" ]; then
## ADD MISSING ENTRY BLABLA
echo "$FILE2 was missing, added!"
fi
done
done
)
Avg result:
real 0m0.174s
user 0m0.052s
sys 0m0.128s
time (
for FILE1 in `find debugme -name "*.txt"` ;do
FILE2() {
S=`grep $1 debug.files`
if [ "$S" = "" ] ; then
## ADD MISSING ENTRY BLABLA
echo "$FILE2 was missing, added!"
fi
}
while read cola colb ; do
$(FILE2 $cola)
done < $FILE1
done
)
Avg result:
real 0m0.269s
user 0m0.064s
sys 0m0.228s
But if you are really serious about performance then using perl(1) would be quicker, that is using one binary to do all the processing rather than calling: grep(1), test(1), sed(1), etc.
Saying that I know some very able folk who could do the whole thing in nawk(1)!
That was just an example, the .txt file(s) is/are either in the main directory or in subdirectorys and debugme is a variable. I guess that find is necessary?
Oh and I'm looking for the most effective way of doing this in bash, no perl
ps. I've tested your suggestions, here is the result:
With your modifications:
real 0m0.171s
user 0m0.060s
sys 0m0.116s
Without:
real 0m0.170s
user 0m0.044s
sys 0m0.124s
---------- Post updated at 05:14 PM ---------- Previous update was at 02:36 PM ----------
So nobody has any suggestions on how to make it run quicker?
---------- Post updated 11-05-09 at 04:29 AM ---------- Previous update was 11-04-09 at 05:14 PM ----------
Seriously I though that some of the experts here would know how to optimize such a task
find debugme -name "*.txt" | awk 'BEGIN{
# get all the lines of debug.files into array for later comparison. Analogous to grep "$FILE2" "debug.files"
while( (getline line < "debug.files" ) > 0 ) {
a[++d]=line
}
close("debug.files")
}
{
filename=$0
f=0 #at the start of processing each file, set f=0 marker
while( (getline line < filename ) > 0 ){
m=split(line,t," ") # split the line on space, this is same as your awk "{print $1}"
if ( t[1]+0 == t[1]){ # this should be equivalent to grep -i '^\([0-9]\+\)$
for(i=1;i<=d;i++){
## print out variable values for debugging as needed
if( a ~ t[1] ){ # go through the lines of debug.files and compare with t[1]
print "found"
found=1
}
}
}
if(f==0){
print "not found"
}
}
close(filename)
}'
Your attempt seems to be best when it comes to performance but it doesn't work as it shows not found for each line! I'd appreciate it alot if you could get it working.
no, you should be the one getting it working, since firstly, you have the exact environment, i don't. Secondly, its not my work, its yours. I have edited the script with comments. look at it, play with it, read the docs, print out the values of the variables to see what they contain at runtime, do whatever. then post again if you hit problems.
Okey thanks alot, also the code I've posted was only an example and in my real script there's a different regex pattern then ^\([0-9]\+\)$ so that might be the reason why it's not working.
That's how I've got it set:
for FILE in `awk '{print $1}' "$FILE" | grep -i '\(rar\|r[0-9]\+\)$'` ;do
I do not really understand how t[1]+0 == t[1] is equivalent to grep -i '^\([0-9]\+\)$... But how would I make it work for the real patter I use? (grep -i '\(rar\|r[0-9]\+\)$')