search from a list of words

jl487 · July 31, 2011, 3:16pm

Hello,
I'm trying to write a bash script that will search for words from one list that may be found in another list. Once the record is found, it will create a new text file for each word.

For example, list1.txt contains the following:

Dog
Cat
Fish

List2.txt contains

Dog - Buddy 14
Charlie - Rhino
Bird - Steph 32
Ralph - Dog
Cat - John
Mike - Fish

Since Dog and Cat are found in both files, two files will be created. The first file (Dog) will be a .txt file containing

Dog - Buddy 14
Raph - Dog

The second file will be called Cat.txt and will have

Cat - John

Here's what I have so far. I'm stuck and I'm not quite sure how to proceed

#!/bin/bash
for $i in list1.txt; do
grep -wi '$i' list2.txt >> $i.txt
done

I'm dealing with VERY large files where list1.txt contains 213 entries while list2.txt contaings 12,000 entries. I think I'm on the right track, but my method seems like it would also take a VERY long time since it's a FOR LOOP for each iteration (yikes!)

Any help would be greatly appreciated.

agama · July 31, 2011, 4:00pm

Yes, making a pass across your 12,000 record data file for each entry in the list isn't very efficient. First thing I'll point out is that your for loop will not be listing the contents of the list, but the file name. You'd need something like this:

#!/bin/bash
while read i 
do
grep -wi '$i' list2.txt >> $i.txt
done <list1.txt

This reads the contents of list1.txt placing each line into the variable i. Still not efficient, but I wanted to point out the problem with your code.

Using awk, you can make one pass across each file. Way more efficient in terms of numer of i/o operations, but not as efficient as writing a programme to do the same thing in C.

#!/usr/bin/env ksh

# assume list1 list2 are placed on the command line
awk -v list=$1 '
    BEGIN {
        while( (getline<list) > 0 )   # load all target words from first list
            targets[$1] = 1;
        close( list );
    }

    {
        for( i = 1; i <= NF; i++ )  # examine each token to see if it is a target
        {
            if( targets[$(i)] )   # if this token in the input is in the target list, save the line
            {
                printf( "%s\n", $0 ) >>$(i) ".txt";
                close( $(i) ".txt" );    # prevent problems if process limit for number of open files is small
                break;      # remove if line can have multiple targets
            }
            else
              delete  targets[$(i)];    # prevent an entry for every word 
        }
    }
' $2
exit

You could make this more efficient by tracking most recently used files and allowing awk to keep some number (100) open and closing the rest. The programme would be executing far less opens/closes on the output files. You'd probably not have any issue keeping 212 of them open, but if your target list grows, or your system has smallish quotas on open files, you could have issues which is why I suggested closing the file after each write. Another, and easier, way would be to write a single output file of the form <filename> <text> as an intermediate file. Once the initial processing is finished, the intermediate file could be sorted and a single pass made to write each separate file. This has the advantage of opening/closing each output file just once and thus avoids the efficiency problems in my example above.

The need for the delete stems from some awk implementations which create an entry in the hash when the test is made (when targets[foo] does not exist). Without the delete, the hash will eventually contain an entry for every word in the list2.txt file rather than just the ones from the first list. These extra entries all have the value 0, so the programme works, but the memory usage is unnecessarily large. The delete statement prevents awk from keeping entries in the target hash that have a zero value, but it adds to the execution time.

jl487 · July 31, 2011, 6:06pm

thanks for the reply. I understand that my method is inefficient, but I was wondering why the following wont work. Do I have a syntax error somewhere? When I run the following code, I get the error "syntax error near unexpected token 'done'"

#!/bin/bash
while read word; do
grep -w "$word" list2.txt
done < list1.txt >> "$word".txt
cat "$word".txt

When I run the command

grep -w SAMPLE_TEXT list2.txt

it gives me the desired output.

agama · July 31, 2011, 7:12pm

You're on the right track. The redirection to $word.txt needs to happen inside of the loop. Yes, you can redirect the output of the loop to a file, but that output file is opened once by the shell at the start of the loop. When the loop starts $word is empty and thus you're getting a syntax error (nothing after >>). This is the small change that will get you going:

#!/bin/bash
while read word; do
grep -w "$word" list2.txt >> "$word".txt
done < list1.txt

Further, your cat command will only have the last word from list1 to work on unless you put it into a loop too:

while read word
do
    echo "===== $word.txt ======="
    cat "$word".txt 
done <list1.txt