Removing file lines that each match to a different patterns

I have a very large file (10,000,000 lines), that contains a sample id and a property of that sample. I have another file that contains around 1,000,000 lines with sample ids that I want to remove from the original file (create a new file without these lines).
I know how to do this in Perl, but it is too time consuming to run. I am aware of sed and awk as commands that should be able to complete this task in a much faster time. I have tried to implement codes that I thought would work, even after consulting previous posts, none seem to quite cover it. I also find it hard to debug as the server I'm working on is French so I don't understand the error messages of my command.

Please could anyone suggest a quick way of achieving this ?

Here are examples of the files I'm dealing with.

Here is a tab delineated sample id and property.

HELIUM:1:2:3      ABCDEF
HELIUM:1:2:4      ADEFBC
HELIUM:1:2:5      BDFACE
HELIUM:1:2:6      BEBACG
HELIUM:1:2:7      ABCDEF
HELIUM:1:2:8      ADEFBC
HELIUM:1:2:9      BDFACE
HELIUM:1:3:0      BEBACG

Here is a list of ids (The common prefix is missing) I wish to remove:

:1:2:3
:1:2:5
:1:2:6
:1:2:9

Many thanks in advance for any help you can provide.

Have you tried using grep? Use it as

grep -v -f "ids.txt" "sample-id-property.txt" > remainder.txt

Please check how much time its consuming.

grep -v -f ids_to_remove_file orignal_file

If your second example 1:2:3 is representative of the actual file contents of the small file, i.e., it has no prefix and no suffixed data either

awk -F':'  ' FILENAME=="smallfile" {arr[$1 $2 $3]++}
               FILENAME=="bigfile" {tmp=$2 $3 $4; if(tmp in arr) {next}; print $0 }
           ' smallfile bigfile >  newfile

Also

export LC_ALL=C

may help your error message language problem.

Great, thank you!
The grep works, but I was afraid it would also run too slowly. I just did a sample that searched 10,000 lines for 1,000 ids and it worked in about 2 seconds. I'm rather happy with that. I just hope the large files don't add too much load.

@jim mcnamara. The awk works wonderfully, but how do I get the data into a new file rather than print?

Thanks a lot for the help with the language problem. I'll definitely use that.

Add the code in red.

For comparison, the awk does the same task in 0.4 seconds.
Many thanks!

grep -f searches thru the entire small file for each line of input from the bigfile.
Some grep implementations map the -f file into memory. This peeds up traversing the -f dataset.

Hi. sorry to come back to this a day later, but I've found that the awk doesn't capture all the lines that I wanted.

I abbreviated the ids I gave you but I adjusted for that in the script you gave me.

The files look like this.

Id and property:
HELIUM:7:100:1000:1007#0/1 abbaabbbbb

Id only
:7:100:1000:1586#0/1

The code I used was

awk -F':'  ' FILENAME=="ids_to_purge.txt" {arr[$2 $3 $4 $5]++}
    FILENAME=="id_property.txt" {tmp=$2 $3 $4 $5; if(tmp in arr) {next}; print $0 }
    ' ids_to_purge.txt id_property.txt >  output.txt

Since we've used the ":" at the field delim, wont the tab and property be at the end of $5 in tmp? Thus the match wont be found.

I'd really appreciated your thoughts for fixing this.
J

you can have multiple delimiters - I am going to use a \t to represent, but when you hit the tab key it will not make a \t appear on your screen

awk -F '[\t :]'  ' { awk program goes here }'  inputfile > outputfile

This make tab, space, and colon characters into field delimiters. This also shows why
presenting a real data sample from the get-go would have made things work right the first time.

N.B.:
On Solaris try nawk instead of awk.