Performance issue with fgrep -vf file1 file2>file3

Hi all,

My requirement is that i have two files file1 and file2. file1 content is present in file2. i want to have a final which will have file2 contents but not the contents of file1 so I am running below command.

fgrep -vf file1 file2>file3

no of records in file1 is 65282
no of records in file2 is 88187

and fgrep -v is taking 25Minutes to complete the operation
Is there any other way we can do same operation in less time?

You could create a ram disk and place the files on the ram disk...then do the fgrep. Just a thought.

Thanks blackrageous!! but i don't have privelege to access ram disk in office.. is there any other way using awk or sed??

sort both files and use comm , or, if you are sure ALL file1 lines are found in file2, try sort -u file1 file2

Hi RudiC

Please have a look at contents of files below.

file1

QC20011                         063890404       02002
QC20011                         059400669       02002
QC20011                         063309945       02002
QC20011                         064005208       02002
QC20011                         064426764       02002
QC20011                         070251327       02002
QC20011                         065565551       02006

file2

3       QC20011                         063890404       02002   JONES & CO MYA MIRROR CLOCK                              00000.00        00000.00        00000.00                       20130729        20130814
3       QC20011                         063890404       02007   JONES & CO MYA MIRROR CLOCK                              00000.00        00000.00        00000.00                       20130729        20130814
3       QC20011                         063890404       02012   JONES & CO MYA MIRROR CLOCK                              00000.00        00000.00        00000.00                       20130729        20130814
3       QC20011                         063890404       02015   JONES & CO MYA MIRROR CLOCK                              00000.00        00000.00        00000.00                       20130729        20130814
3       QC20011                         063890404       02017   JONES & CO MYA MIRROR CLOCK                              00000.00        00000.00        00000.00                       20130729        20130814
3       QC20011                         063890404       02019   JONES & CO MYA MIRROR CLOCK                              00000.00        00000.00        00000.00                       20130729        20130814

do you think comm or sort -u will work?

I'm not sure that this will run faster than what you tried before, but nevertheless give it a shot:

awk 'NR==FNR {T[$1$2$3]; next} ($2$3$4 in T) {next} 1' file1 file2 
1 Like

Great Thanks so much RudiC... Its worked and faster than previous completed within 2 Min only..

Can you please explain also below code what does that mean?

awk 'NR==FNR {T[$1$2$3]; next} ($2$3$4 in T) {next} 1' file1 file2
awk     'NR==FNR        {T[$1$2$3]; next}       # suck in file1 into array T, indexed by $1, $2, $3 concatenated
         ($2$3$4 in T)  {next}                  # if concat. $2, $3, $4 found in T, record existed in file1, read next line, start over      
         1                                      # print the orig. line from file 2, as not in file1
        ' file1 file2

Hi RudiC,

Thanks for your explanation. But got below query.

awk 'NR==FNR {T[$1$2$3]; next} ($2$3$4 in T) {next} 1' file1 file2

Regarding above code. I just found that if content of file1 is zero then its not working. Ideally even if size of file1 is zero it should return all the content of file2.

Yes. If file1 has zero length, it takes file2 as file1 and has nothing left to work upon.
Try

awk 'FILENAME==ARGV[1] {T[$1$2$3]; next} ($2$3$4 in T) {next} 1' file1 file2
1 Like

PERFECT Thanks RudiC!!!!