Hello,
I'm trying to write a bash script that will search for words from one list that may be found in another list. Once the record is found, it will create a new text file for each word.
For example, list1.txt contains the following:
Dog
Cat
Fish
List2.txt contains
Dog - Buddy 14
Charlie - Rhino
Bird - Steph 32
Ralph - Dog
Cat - John
Mike - Fish
Since Dog and Cat are found in both files, two files will be created. The first file (Dog) will be a .txt file containing
Dog - Buddy 14
Raph - Dog
The second file will be called Cat.txt and will have
Cat - John
Here's what I have so far. I'm stuck and I'm not quite sure how to proceed
#!/bin/bash
for $i in list1.txt; do
grep -wi '$i' list2.txt >> $i.txt
done
I'm dealing with VERY large files where list1.txt contains 213 entries while list2.txt contaings 12,000 entries. I think I'm on the right track, but my method seems like it would also take a VERY long time since it's a FOR LOOP for each iteration (yikes!)
Yes, making a pass across your 12,000 record data file for each entry in the list isn't very efficient. First thing I'll point out is that your for loop will not be listing the contents of the list, but the file name. You'd need something like this:
#!/bin/bash
while read i
do
grep -wi '$i' list2.txt >> $i.txt
done <list1.txt
This reads the contents of list1.txt placing each line into the variable i. Still not efficient, but I wanted to point out the problem with your code.
Using awk, you can make one pass across each file. Way more efficient in terms of numer of i/o operations, but not as efficient as writing a programme to do the same thing in C.
#!/usr/bin/env ksh
# assume list1 list2 are placed on the command line
awk -v list=$1 '
BEGIN {
while( (getline<list) > 0 ) # load all target words from first list
targets[$1] = 1;
close( list );
}
{
for( i = 1; i <= NF; i++ ) # examine each token to see if it is a target
{
if( targets[$(i)] ) # if this token in the input is in the target list, save the line
{
printf( "%s\n", $0 ) >>$(i) ".txt";
close( $(i) ".txt" ); # prevent problems if process limit for number of open files is small
break; # remove if line can have multiple targets
}
else
delete targets[$(i)]; # prevent an entry for every word
}
}
' $2
exit
You could make this more efficient by tracking most recently used files and allowing awk to keep some number (100) open and closing the rest. The programme would be executing far less opens/closes on the output files. You'd probably not have any issue keeping 212 of them open, but if your target list grows, or your system has smallish quotas on open files, you could have issues which is why I suggested closing the file after each write. Another, and easier, way would be to write a single output file of the form <filename> <text> as an intermediate file. Once the initial processing is finished, the intermediate file could be sorted and a single pass made to write each separate file. This has the advantage of opening/closing each output file just once and thus avoids the efficiency problems in my example above.
The need for the delete stems from some awk implementations which create an entry in the hash when the test is made (when targets[foo] does not exist). Without the delete, the hash will eventually contain an entry for every word in the list2.txt file rather than just the ones from the first list. These extra entries all have the value 0, so the programme works, but the memory usage is unnecessarily large. The delete statement prevents awk from keeping entries in the target hash that have a zero value, but it adds to the execution time.
thanks for the reply. I understand that my method is inefficient, but I was wondering why the following wont work. Do I have a syntax error somewhere? When I run the following code, I get the error "syntax error near unexpected token 'done'"
#!/bin/bash
while read word; do
grep -w "$word" list2.txt
done < list1.txt >> "$word".txt
cat "$word".txt
You're on the right track. The redirection to $word.txt needs to happen inside of the loop. Yes, you can redirect the output of the loop to a file, but that output file is opened once by the shell at the start of the loop. When the loop starts $word is empty and thus you're getting a syntax error (nothing after >>). This is the small change that will get you going:
#!/bin/bash
while read word; do
grep -w "$word" list2.txt >> "$word".txt
done < list1.txt
Further, your cat command will only have the last word from list1 to work on unless you put it into a loop too:
while read word
do
echo "===== $word.txt ======="
cat "$word".txt
done <list1.txt