Bash script too slow

tigta09 · January 29, 2010, 9:52am

I have a bash script that will take approx. 130 days to complete. I am trying to grep a list of 1,144 user ID's out of 41 (1 GB each) files. The 41 files were originally one 41 G file, but that was horrendously too slow.
This is my current file:

#!/bin/bash
      for i in `cat WashFD.txt`  # 1,144 files
          do
           for b in `cat xfiles` # 41 "x??" files
            do
          echo "looking for " $i "in " $b
          cat $b | grep -i $i   >> SEID.searches
      done
    done

Currently, I am processing one of the 41 files every 4 minutes. 4 x 41 = 164 min.
164 / 60 (min/hour) = 2.73 hours per user_id. I have 1,144 user_id's multiplied by 2.73 = 3123.12 hours. 3123.12 / 24 (hours in a day) = 130.13 days.

As you can see, that is way too long to process this task. I don't know PERL but I've heard its faster. If anyone has any suggestions please let me know.

trey85stang · January 29, 2010, 9:57am

can you give an example of "xfiles"

pludi · January 29, 2010, 10:01am

What platform/UNIX are you on, what's the format of the lines, and what do the user IDs look like?

tigta09 · January 29, 2010, 10:06am

The x files contain http log entries. The fuchsia colored object is the user_id that I am looking for.

2009-09-29 13:59:04 DD\\ABCDE 152.225.186.39 Search Engines
and Portals GET http://ui.sina.com/assets/js/jump_home.js applica
tion/x-javascript 262 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5
.1; SV1; InfoPath.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322) TCP_HIT/304 SINA.com US ??? - ?? DIRECT/12.130.152.120 192.168.40.148

---------- Post updated at 10:06 AM ---------- Previous update was at 10:04 AM ----------

the user id's look like DD\\ABCDE . the platform is Ubuntu Linux 8.10

dpc.ucore.info · January 29, 2010, 10:19am

You don't seem to use regex at all. Use a "fgrep" or "grep -F" to work in fixed strings mode. This way it processing time will be nothing comparing to reading data from the disk.

---------- Post updated at 05:19 PM ---------- Previous update was at 05:11 PM ----------

OK. I can see that you're reading each file multiple times. This is the cause of the problem, not processing time.

Use basic grep regexes and first compose the string of usernames like this:

user1|user2|user3|user4|...

Then grep each source file looking for all matches at once. Don't use

cat FILE | grep STRING

This is slower then simple:

grep STRING FILE

!
Save all matches to temporary file and from this file check for each username. As this file should be much smaller then the original (I assume) you will save much time when reading it multiple time for each user.

pludi · January 29, 2010, 10:22am

OK here's what you can do:

use grep with -F as dpc.ucore.info suggested
use the -f file switch as described in the man page. That will allow you to load and search for multiple search strings at once
don't loop over 41 files, but specify them all at the command line at once. Use -H to display which file it was found in if needed.

With these, you could cut your search down to

grep -F -f WashFD.txt $( cat xfiles ) > SEID.searches

tigta09 · January 29, 2010, 10:54am

After running:

grep -F -H -f  WashFD.txt $( cat xfiles ) > SEID.searches

the SEID.searches file is growing very fast, however the entries do not match what is in the WashFD.txt.

Am I missing something?

methyl · January 29, 2010, 11:22am

What is in WashFD.txt . Is it a standard unix text file?

head WashFD.txt | sed -n l

tail WashFD.txt | sed -n l

tigta09 · January 29, 2010, 11:30am

head WashFD.txt |sed -n l
DS\\\\01FNB$
DS\\\\01KFB$
prod\\\\sealccw$
DS\\\\04JFB$
DS\\\\05PJB$
DS\\\\080HB$
DS\\\\09VLB$
DS\\\\0G9JB$
DS\\\\0JKFB$
DS\\\\0K0FB$
tail WashFD.txt |sed -n l
DS\\\\ZSVMB$
DS\\\\ZT0CB$
DS\\\\ZT2BB$
DS\\\\ZTJNB$
DS\\\\ZVYHB$
DS\\\\ZW2CB$
DS\\\\ZWKFB$
prod\\\\sealnhm$
DS\\\\ZY6GB$
DS\\\\ZY7CB$

methyl · January 29, 2010, 11:56am

I see the problem (the backslashes) but not the solution.
I should say that I half expected to see a blank line in the match file which would have matched everything.

tigta09 · January 29, 2010, 12:18pm

I used vi to remove all the slashes and preceding white space. I am running the script again, and the SEID.searches file is not growing wildly as it did before. At this point it is still empty after about 5 minutes. I hope this means that it just hasn't found any matches as of yet.

rdcwayx · January 29, 2010, 3:27pm

I try to understand your purpose, more than your description.

Seems you just need collect all users in the logs for counting or some thing else. Maybe you needn't the userid file at all.

Here is a sample for you.

awk '{print $3}' x* |sort |uniq -c |sort -u