I have a bash script that will take approx. 130 days to complete. I am trying to grep a list of 1,144 user ID's out of 41 (1 GB each) files. The 41 files were originally one 41 G file, but that was horrendously too slow.
This is my current file:
#!/bin/bash
for i in `cat WashFD.txt` # 1,144 files
do
for b in `cat xfiles` # 41 "x??" files
do
echo "looking for " $i "in " $b
cat $b | grep -i $i >> SEID.searches
done
done
Currently, I am processing one of the 41 files every 4 minutes. 4 x 41 = 164 min.
164 / 60 (min/hour) = 2.73 hours per user_id. I have 1,144 user_id's multiplied by 2.73 = 3123.12 hours. 3123.12 / 24 (hours in a day) = 130.13 days.
As you can see, that is way too long to process this task. I don't know PERL but I've heard its faster. If anyone has any suggestions please let me know.
You don't seem to use regex at all. Use a "fgrep" or "grep -F" to work in fixed strings mode. This way it processing time will be nothing comparing to reading data from the disk.
---------- Post updated at 05:19 PM ---------- Previous update was at 05:11 PM ----------
OK. I can see that you're reading each file multiple times. This is the cause of the problem, not processing time.
Use basic grep regexes and first compose the string of usernames like this:
user1|user2|user3|user4|...
Then grep each source file looking for all matches at once. Don't use
cat FILE | grep STRING
This is slower then simple:
grep STRING FILE
!
Save all matches to temporary file and from this file check for each username. As this file should be much smaller then the original (I assume) you will save much time when reading it multiple time for each user.
I see the problem (the backslashes) but not the solution.
I should say that I half expected to see a blank line in the match file which would have matched everything.
I used vi to remove all the slashes and preceding white space. I am running the script again, and the SEID.searches file is not growing wildly as it did before. At this point it is still empty after about 5 minutes. I hope this means that it just hasn't found any matches as of yet.