Please Help. Strings in file 1 need to be searched and removed from file 2

mjs3221 · August 18, 2006, 1:41pm

Please help. Here is my problem. I have 9000 lines in file a and 500,000 lines in file b. For each line in file a I need to search file b and remove that line. I am currently using the grep -v command and loading the output into a new file. However, because of the size of file b this takes an extremely long time to do and I have 50 files similiar to file b. Is there a simpler way to accomplish this. Here is a code snippet of what I have so far.

cat $1 | while read LINE
do
echo $LINE

grep -v $LINE fileName > OutputFile

cp OutputFile fineName

done

tmarikle · August 18, 2006, 2:12pm

awk is amazingly well suited for this kind of operation. I've seen a similar method to yours take several days on flat files containing ~ 20 million records (I can't recall exactly) and something similar to the following to less than 3 minutes.

nawk '
    # While processing records from file a (9000 lines)
    FILENAME=="file_a.txt" {
        # Record key value that should be excluded from file b
        Keys[$1]++
    }
    
    # While processing records from file b (50000)
    FILENAME=="file_b.txt" {
        # Look up key value in keys collected from file a
        if (Keys[$1] == 0) {
            # If the key is not found in the key array, save in the delta file
            #print > "deltas.txt"
            print $0
        }
    }
' file_a.txt file_b.txt

I created a 9000 record test file (file_a.txt) and a 50000 record test file (file_b.txt) that consisted of one key field in each and the process took 1/5 second.

tmarikle · August 18, 2006, 2:14pm

Repeating file_b.txt 10x in my example simulates your 500,000 record file, which processed in 2.88 seconds.

Also, the script can be simplified in the file_b.txt segment as follows:

FILENAME=="file_b.txt" && ! Keys[$1] { print > "delta.txt" }

mahendramahendr · August 18, 2006, 2:53pm

Try this

egrep -f $1 -v fileName > saveFile

tmarikle · August 18, 2006, 3:13pm

How is this going to meet the OP's objective of speed? 9000 vs 500000 using egrep is really slow.