I'd be grateful for your help with the following. I have a file (file.txt) with 10 columns and about half a million lines, which in simplified form looks like this:
ID Col1 Col2 Col3....
a 4 2 8
b 5 6 1
c 8 4 1
d 3 5 9
e 8 5 2
I'd like to remove all the lines where, say, "b" and "d" appear in the first (ID) column. The output that I want is:
ID Col1 Col2 Col3....
a 4 2 8
c 8 4 1
e 8 5 2
In reality, there are about 100,000 lines that I want to remove.
I therefore have a reference file (referencefile.txt) that lists all the IDs that I want removed from file.txt. In this example, the reference file would simply contain "b" and "d" on successive lines.
I am using grep at the moment, and while it works, it is proving painfully slow.
grep -v -f referencefile.txt file.txt
Is there a way of using awk (or anything else for that matter) to speed up the process?
This requires a lot of memory depending on what you have in reference.txt
Simple awk which can be rewritten as something difficult to read for non-awkers.
We have posters who do that, which is okay as long as you can get what they show you.
# code assumes that the reference.txt file has field #1 from inputfile
awk ' FILENAME=="reference.txt" {! arr[$0]++; next} # create an array of values
FILENAME=="inputfile" { if(! $1 in arr) {print $0}; next} ' reference.txt inputfile > outputfile
I do not understand the ! and ++ in {! arr[$0]++; next}
Replace by {arr[$1]; next} . Not storing a value in the array saves sone memory! $1 strips spaces, can make sense if there is invisible trailing space (and embedded spaces wouldn't work anyway when later comparing with $1). The next jumps to the next cycle, no need for checking the FILENAME again. {print $0} is a default action if there is just a condition.
awk ' FILENAME=="reference.txt" {arr[$1]; next} # create an array without values
!($1 in arr)' reference.txt inputfile > outputfile