awk function to remove lines that contain contents of another file

aberg · September 19, 2017, 7:30am

Hi,

I'd be grateful for your help with the following. I have a file (file.txt) with 10 columns and about half a million lines, which in simplified form looks like this:

ID     Col1    Col2  Col3....
a        4         2       8
b        5         6       1
c        8         4       1
d        3         5       9
e        8         5       2

I'd like to remove all the lines where, say, "b" and "d" appear in the first (ID) column. The output that I want is:

ID     Col1    Col2  Col3....
a        4         2       8
c        8         4       1
e        8         5       2

In reality, there are about 100,000 lines that I want to remove.
I therefore have a reference file (referencefile.txt) that lists all the IDs that I want removed from file.txt. In this example, the reference file would simply contain "b" and "d" on successive lines.

I am using grep at the moment, and while it works, it is proving painfully slow.

grep -v -f referencefile.txt file.txt

Is there a way of using awk (or anything else for that matter) to speed up the process?

Many thanks.

AB

jim_mcnamara · September 19, 2017, 8:52am

This requires a lot of memory depending on what you have in reference.txt
Simple awk which can be rewritten as something difficult to read for non-awkers.
We have posters who do that, which is okay as long as you can get what they show you.

# code assumes that the reference.txt file has field #1 from inputfile

awk ' FILENAME=="reference.txt" {! arr[$0]++; next}  # create an array of values 
         FILENAME=="inputfile" { if(! $1 in arr) {print $0}; next} ' reference.txt inputfile > outputfile

aberg · September 19, 2017, 9:26am

Thanks Jim - that works. Much appreciated.

A.B.

RudiC · September 19, 2017, 5:10pm

It would be interesting what performance gain you see - can you time both approaches and post the results?

MadeInGermany · September 20, 2017, 1:52am

I do not understand the ! and ++ in {! arr[$0]++; next}
Replace by {arr[$1]; next} . Not storing a value in the array saves sone memory! $1 strips spaces, can make sense if there is invisible trailing space (and embedded spaces wouldn't work anyway when later comparing with $1). The next jumps to the next cycle, no need for checking the FILENAME again. {print $0} is a default action if there is just a condition.

awk ' FILENAME=="reference.txt" {arr[$1]; next}  # create an array without values 
        !($1 in arr)' reference.txt inputfile > outputfile