sed / grep / for statement performance - please help

TehOne · November 4, 2009, 4:41pm

I'm searching the most effective way of doing the following task, so if someone can either provide a working solution with sed or one totally different but more effective then what I've got so far then please go ahead!

The debugme directory has 3 subdirectorys and each of them has one .txt file with about 48 entrys each.

time (
for FILE1 in `find debugme -name "*.txt"` ;do
    for FILE2 in `cat "$FILE1" | awk '{print $1}' | grep -i '^\([0-9]\+\)$'` ;do
    #for FILE2 in `sed 's/\([0-9]\+\) \([a-zA-Z0-9]\+\)/\1/i' "$FILE"` ;do
        CHECK=`grep "$FILE2" "debug.files"`
        if [ "$CHECK" = "" ]; then
            ## ADD MISSING ENTRY BLABLA
            echo "$FILE2 was missing, added!"
        fi
    done
done
)

Avg result:
real    0m0.174s
user    0m0.052s
sys     0m0.128s

time (
for FILE1 in `find debugme -name "*.txt"` ;do
    FILE2() {
        S=`grep $1 debug.files`
        if [ "$S" = "" ] ; then
            ## ADD MISSING ENTRY BLABLA
            echo "$FILE2 was missing, added!"
        fi
    }
    while read cola colb ; do
        $(FILE2 $cola)
    done < $FILE1
done
)

Avg result:
real    0m0.269s
user    0m0.064s
sys     0m0.228s

The .txt files are in this format:

[03:53:22] root:~# cat example.txt
52352578 ABF2778ABD^
73534536 LASDM337lA^
83523422 JFAASMM31^

And debug.files in this:

[03:53:25] root:~# cat debug.files
52352578
73534536

TonyFullerMalv · November 4, 2009, 4:54pm

I suspect that:

if [ -z "`grep $1 debug.files`" ] ; then

will be quicker than:

S=`grep $1 debug.files`
if [ "$S" = "" ] ; then

Also:

awk '{print $1}' "$FILE1" |

will be more efficient than:

cat "$FILE1" | awk '{print $1}' |

But if you are really serious about performance then using perl(1) would be quicker, that is using one binary to do all the processing rather than calling: grep(1), test(1), sed(1), etc.

Saying that I know some very able folk who could do the whole thing in nawk(1)!

Hope this helps...

TonyLawrence · November 4, 2009, 5:26pm

And do you need "find" at all?

As it's only three directories and 1 file in each

for file in  */*.txt; do

is surely faster than starting up find.

If that's what you are after, of course.

TehOne · November 5, 2009, 7:29am

tonylawrence:

And do you need "find" at all?

As it's only three directories and 1 file in each
for file in  */*.txt; do 
is surely faster than starting up find.

If that's what you are after, of course.

That was just an example, the .txt file(s) is/are either in the main directory or in subdirectorys and debugme is a variable. I guess that find is necessary?

Oh and I'm looking for the most effective way of doing this in bash, no perl

ps. I've tested your suggestions, here is the result:

With your modifications:
real    0m0.171s
user    0m0.060s
sys     0m0.116s
Without:
real    0m0.170s
user    0m0.044s
sys     0m0.124s

---------- Post updated at 05:14 PM ---------- Previous update was at 02:36 PM ----------

So nobody has any suggestions on how to make it run quicker?

---------- Post updated 11-05-09 at 04:29 AM ---------- Previous update was 11-04-09 at 05:14 PM ----------

Seriously I though that some of the experts here would know how to optimize such a task

ghostdog74 · November 5, 2009, 9:01am

get rid of all those useless cats, greps and sed.

find debugme -name "*.txt" | awk 'BEGIN{
   # get all the lines of debug.files into array for later comparison. Analogous to grep "$FILE2" "debug.files"
   while( (getline line < "debug.files" ) > 0 ) {
       a[++d]=line
   }
   close("debug.files")
}
{
   filename=$0
   f=0 #at the start of processing each file, set f=0 marker
   while( (getline line < filename ) > 0 ){
       m=split(line,t," ")  # split the line on space, this is same as your awk "{print $1}"
       if ( t[1]+0 == t[1]){  # this should be equivalent to grep -i '^\([0-9]\+\)$
           for(i=1;i<=d;i++){
              ## print out variable values for debugging as needed
              if( a ~ t[1] ){  # go through the lines of debug.files and compare with t[1]
                print "found"
                found=1   
              }
           }
       }       
       if(f==0){
        print "not found"
       }
   }
   close(filename)   
}'

NB:not tested.

TehOne · November 5, 2009, 9:37am

ghostdog74:

get rid of all those useless cats, greps and sed.

find debugme -name "*.txt" | awk 'BEGIN{
   while( (getline line < "debug.files" ) > 0 ) {
   a[++d]=line
   }
   close("debug.files")
}
{
   filename=$0
   while( (getline line < filename ) > 0 ){
   m=split(line,t," ")
   if ( t[1]+0 == t[1]){
   for(i=1;i<=d;i++){
   if( a ~ t[1] ){
   print "found"
   found=1   
   }
   }
   }       
   if(f==0){
   print "not found"
   }
   }
   close(filename)   
}'

NB:not tested.

Your attempt seems to be best when it comes to performance but it doesn't work as it shows not found for each line! I'd appreciate it alot if you could get it working.

ghostdog74 · November 5, 2009, 9:59am

no, you should be the one getting it working, since firstly, you have the exact environment, i don't. Secondly, its not my work, its yours. I have edited the script with comments. look at it, play with it, read the docs, print out the values of the variables to see what they contain at runtime, do whatever. then post again if you hit problems.

TehOne · November 5, 2009, 11:29am

Okey thanks alot, also the code I've posted was only an example and in my real script there's a different regex pattern then ^$[0-9]\+$$ so that might be the reason why it's not working.

That's how I've got it set:

for FILE in `awk '{print $1}' "$FILE" | grep -i '\(rar\|r[0-9]\+\)$'` ;do

I do not really understand how t[1]+0 == t[1] is equivalent to grep -i '^$[0-9]\+$$... But how would I make it work for the real patter I use? (grep -i '$rar\|r[0-9]\+$$')