Removing file lines that each match to a different patterns

Jo_puzzled · April 14, 2010, 7:01am

I have a very large file (10,000,000 lines), that contains a sample id and a property of that sample. I have another file that contains around 1,000,000 lines with sample ids that I want to remove from the original file (create a new file without these lines).
I know how to do this in Perl, but it is too time consuming to run. I am aware of sed and awk as commands that should be able to complete this task in a much faster time. I have tried to implement codes that I thought would work, even after consulting previous posts, none seem to quite cover it. I also find it hard to debug as the server I'm working on is French so I don't understand the error messages of my command.

Please could anyone suggest a quick way of achieving this ?

Here are examples of the files I'm dealing with.

Here is a tab delineated sample id and property.

HELIUM:1:2:3      ABCDEF
HELIUM:1:2:4      ADEFBC
HELIUM:1:2:5      BDFACE
HELIUM:1:2:6      BEBACG
HELIUM:1:2:7      ABCDEF
HELIUM:1:2:8      ADEFBC
HELIUM:1:2:9      BDFACE
HELIUM:1:3:0      BEBACG

Here is a list of ids (The common prefix is missing) I wish to remove:

:1:2:3
:1:2:5
:1:2:6
:1:2:9

Many thanks in advance for any help you can provide.

vino · April 14, 2010, 7:04am

Have you tried using grep? Use it as

grep -v -f "ids.txt" "sample-id-property.txt" > remainder.txt

clx · April 14, 2010, 7:07am

Please check how much time its consuming.

grep -v -f ids_to_remove_file orignal_file

jim_mcnamara · April 14, 2010, 7:11am

If your second example 1:2:3 is representative of the actual file contents of the small file, i.e., it has no prefix and no suffixed data either

awk -F':'  ' FILENAME=="smallfile" {arr[$1 $2 $3]++}
               FILENAME=="bigfile" {tmp=$2 $3 $4; if(tmp in arr) {next}; print $0 }
           ' smallfile bigfile >  newfile

Also

export LC_ALL=C

may help your error message language problem.

Jo_puzzled · April 14, 2010, 7:31am

Great, thank you!
The grep works, but I was afraid it would also run too slowly. I just did a sample that searched 10,000 lines for 1,000 ids and it worked in about 2 seconds. I'm rather happy with that. I just hope the large files don't add too much load.

@jim mcnamara. The awk works wonderfully, but how do I get the data into a new file rather than print?

Thanks a lot for the help with the language problem. I'll definitely use that.

jim_mcnamara · April 14, 2010, 7:37am

Add the code in red.

Jo_puzzled · April 14, 2010, 7:47am

For comparison, the awk does the same task in 0.4 seconds.
Many thanks!

jim_mcnamara · April 14, 2010, 7:53am

grep -f searches thru the entire small file for each line of input from the bigfile.
Some grep implementations map the -f file into memory. This peeds up traversing the -f dataset.

Jo_puzzled · April 15, 2010, 7:20am

Hi. sorry to come back to this a day later, but I've found that the awk doesn't capture all the lines that I wanted.

I abbreviated the ids I gave you but I adjusted for that in the script you gave me.

The files look like this.

Id and property:
HELIUM:7:100:1000:1007#0/1 abbaabbbbb

Id only
:7:100:1000:1586#0/1

The code I used was

awk -F':'  ' FILENAME=="ids_to_purge.txt" {arr[$2 $3 $4 $5]++}
    FILENAME=="id_property.txt" {tmp=$2 $3 $4 $5; if(tmp in arr) {next}; print $0 }
    ' ids_to_purge.txt id_property.txt >  output.txt

Since we've used the ":" at the field delim, wont the tab and property be at the end of $5 in tmp? Thus the match wont be found.

I'd really appreciated your thoughts for fixing this.
J

jim_mcnamara · April 15, 2010, 10:44am

you can have multiple delimiters - I am going to use a \t to represent, but when you hit the tab key it will not make a \t appear on your screen

awk -F '[\t :]'  ' { awk program goes here }'  inputfile > outputfile

This make tab, space, and colon characters into field delimiters. This also shows why
presenting a real data sample from the get-go would have made things work right the first time.

N.B.:
On Solaris try nawk instead of awk.