Pattern exclusion between two files

verse123 · March 3, 2014, 2:08am

I have a large file (file1) that has 96770154 lines and a smaller file (file2) that has 3 lines. I want to remove all of the strings from file1 that occur in file2.

file1 looks like this:

DOGDOGNODOGTESTCAT
CATHELLOBYEEBYEFAT
CATCATDOGDOGCATYESGOOD

file2 looks like this:

YES
NO
GOOD

The output should look like this:

DOGDOGNODOGTESTCAT
CATCATDOGDOGCATYESGOOD

What I have so far is this but the output is not 2 lines. It is instead 6 lines for whatever reason that I'm not understanding:

while read A B; do grep -v $A file1; done < file2 > out

SriniShoo · March 3, 2014, 2:27am

Try the below

awk '{if(NR == FNR) {a[$0]} else {for (x in a) {if($0 ~ x) {print $0; next}}}}' file2 file1

Scrutinizer · March 3, 2014, 2:34am

The while loop will call grep three times with a grep -v YES a grep -v NO and a grep -v GOOD and that output is concatenated.

Alternatively you could try without a while loop with:

grep -vf file2 file1

Note:

DOGDOGNODOGTESTCAT

will be left out since it has the string "NO" in it..

verse123 · March 3, 2014, 2:48am

any way to output each of the output lines into their own separate files?

Scrutinizer · March 3, 2014, 3:12am

You mean every line of the output is a separate file?

grep -vf file2 file1 | split -l 1

SriniShoo · March 3, 2014, 3:20am

all those match YES will go to output_YES, NO will go to output_NO...
but, if a string has both YES & NO, the below code will only send the line to output_YES.

awk '{if(NR == FNR) {a[$0]} else {for (x in a) {if($0 ~ x) {print $0 > "output_" x; next}}}}' file2 file1

If you want the string that contains both YES and NO to be sent to both the output file, use the below

awk '{if(NR == FNR) {a[$0]} else {for (x in a) {if($0 ~ x) {print $0 > "output_" x}}}}' file2 file1

verse123 · March 3, 2014, 8:58pm

This isn't quite what I mean. In general if I have a while loop

while read A B ; do 
    something
    something else file1
done; < file2 > out

How can I output a file each time the while loop goes through 1 cycle. The length of the output can vary so I can't use

split -l 1

.

Chubler_XL · March 3, 2014, 9:15pm

Is something like this what your after:

while read A B
do 
    ( something
     something else file1 ) > out_$A
done < file2

verse123 · March 3, 2014, 9:19pm

so I am trying to do something like this but what goes into say 1.out should not be the same output that goes into 2.out. Each out file should be the output of a single line from the while loop (but that output can vary in length).

for ((i=1;i<=3;i++));
do
        while read A B
        do
                awk '{if(NR == FNR) {a[$0]} else {for (x in a) {if($0 ~ x) {print $0; next}}}}' file2 file1 > $i.out
        done
done

Chubler_XL · March 3, 2014, 9:23pm

You want something more like this then:

while read A B
do
    ((i++))
    awk '{if(NR == FNR) {a[$0]} else {for (x in a) {if($0 ~ x) {print $0; next}}}}' file2 file1 > $i.out
done

Scrutinizer · March 4, 2014, 12:05am

The general solution in #8 should be useable, but it would be better to use code grouping {} instead of a new subshell () for every iteration..

for/while something
do
  { 
    code segment 
  } > file$((i+=1))
done