Please help. Here is my problem. I have 9000 lines in file a and 500,000 lines in file b. For each line in file a I need to search file b and remove that line. I am currently using the grep -v command and loading the output into a new file. However, because of the size of file b this takes an extremely long time to do and I have 50 files similiar to file b. Is there a simpler way to accomplish this. Here is a code snippet of what I have so far.
cat $1 | while read LINE
do
echo $LINE
grep -v $LINE fileName > OutputFile
cp OutputFile fineName
done
awk is amazingly well suited for this kind of operation. I've seen a similar method to yours take several days on flat files containing ~ 20 million records (I can't recall exactly) and something similar to the following to less than 3 minutes.
nawk '
# While processing records from file a (9000 lines)
FILENAME=="file_a.txt" {
# Record key value that should be excluded from file b
Keys[$1]++
}
# While processing records from file b (50000)
FILENAME=="file_b.txt" {
# Look up key value in keys collected from file a
if (Keys[$1] == 0) {
# If the key is not found in the key array, save in the delta file
#print > "deltas.txt"
print $0
}
}
' file_a.txt file_b.txt
I created a 9000 record test file (file_a.txt) and a 50000 record test file (file_b.txt) that consisted of one key field in each and the process took 1/5 second.
Repeating file_b.txt 10x in my example simulates your 500,000 record file, which processed in 2.88 seconds.
Also, the script can be simplified in the file_b.txt segment as follows:
FILENAME=="file_b.txt" && ! Keys[$1] { print > "delta.txt" }
Try this
egrep -f $1 -v fileName > saveFile
How is this going to meet the OP's objective of speed? 9000 vs 500000 using egrep is really slow.