I have a very large file (10,000,000 lines), that contains a sample id and a property of that sample. I have another file that contains around 1,000,000 lines with sample ids that I want to remove from the original file (create a new file without these lines).
I know how to do this in Perl, but it is too time consuming to run. I am aware of sed and awk as commands that should be able to complete this task in a much faster time. I have tried to implement codes that I thought would work, even after consulting previous posts, none seem to quite cover it. I also find it hard to debug as the server I'm working on is French so I don't understand the error messages of my command.
Please could anyone suggest a quick way of achieving this ?
Great, thank you!
The grep works, but I was afraid it would also run too slowly. I just did a sample that searched 10,000 lines for 1,000 ids and it worked in about 2 seconds. I'm rather happy with that. I just hope the large files don't add too much load.
@jim mcnamara. The awk works wonderfully, but how do I get the data into a new file rather than print?
Thanks a lot for the help with the language problem. I'll definitely use that.
grep -f searches thru the entire small file for each line of input from the bigfile.
Some grep implementations map the -f file into memory. This peeds up traversing the -f dataset.
you can have multiple delimiters - I am going to use a \t to represent, but when you hit the tab key it will not make a \t appear on your screen
awk -F '[\t :]' ' { awk program goes here }' inputfile > outputfile
This make tab, space, and colon characters into field delimiters. This also shows why
presenting a real data sample from the get-go would have made things work right the first time.