Performance issue with fgrep -vf file1 file2>file3

Lakshman_Gupta · August 15, 2013, 3:46pm

Hi all,

My requirement is that i have two files file1 and file2. file1 content is present in file2. i want to have a final which will have file2 contents but not the contents of file1 so I am running below command.

fgrep -vf file1 file2>file3

no of records in file1 is 65282
no of records in file2 is 88187

and fgrep -v is taking 25Minutes to complete the operation
Is there any other way we can do same operation in less time?

blackrageous · August 15, 2013, 4:12pm

You could create a ram disk and place the files on the ram disk...then do the fgrep. Just a thought.

Lakshman_Gupta · August 15, 2013, 4:32pm

Thanks blackrageous!! but i don't have privelege to access ram disk in office.. is there any other way using awk or sed??

RudiC · August 15, 2013, 4:36pm

sort both files and use comm , or, if you are sure ALL file1 lines are found in file2, try sort -u file1 file2

Lakshman_Gupta · August 15, 2013, 4:52pm

Hi RudiC

Please have a look at contents of files below.

file1

QC20011                         063890404       02002
QC20011                         059400669       02002
QC20011                         063309945       02002
QC20011                         064005208       02002
QC20011                         064426764       02002
QC20011                         070251327       02002
QC20011                         065565551       02006

file2

3       QC20011                         063890404       02002   JONES & CO MYA MIRROR CLOCK                              00000.00        00000.00        00000.00                       20130729        20130814
3       QC20011                         063890404       02007   JONES & CO MYA MIRROR CLOCK                              00000.00        00000.00        00000.00                       20130729        20130814
3       QC20011                         063890404       02012   JONES & CO MYA MIRROR CLOCK                              00000.00        00000.00        00000.00                       20130729        20130814
3       QC20011                         063890404       02015   JONES & CO MYA MIRROR CLOCK                              00000.00        00000.00        00000.00                       20130729        20130814
3       QC20011                         063890404       02017   JONES & CO MYA MIRROR CLOCK                              00000.00        00000.00        00000.00                       20130729        20130814
3       QC20011                         063890404       02019   JONES & CO MYA MIRROR CLOCK                              00000.00        00000.00        00000.00                       20130729        20130814

do you think comm or sort -u will work?

RudiC · August 15, 2013, 5:03pm

I'm not sure that this will run faster than what you tried before, but nevertheless give it a shot:

awk 'NR==FNR {T[$1$2$3]; next} ($2$3$4 in T) {next} 1' file1 file2

Lakshman_Gupta · August 15, 2013, 6:04pm

Great Thanks so much RudiC... Its worked and faster than previous completed within 2 Min only..

Can you please explain also below code what does that mean?

awk 'NR==FNR {T[$1$2$3]; next} ($2$3$4 in T) {next} 1' file1 file2

RudiC · August 16, 2013, 5:11pm

awk     'NR==FNR        {T[$1$2$3]; next}       # suck in file1 into array T, indexed by $1, $2, $3 concatenated
         ($2$3$4 in T)  {next}                  # if concat. $2, $3, $4 found in T, record existed in file1, read next line, start over      
         1                                      # print the orig. line from file 2, as not in file1
        ' file1 file2

Lakshman_Gupta · August 19, 2013, 7:13am

Hi RudiC,

Thanks for your explanation. But got below query.

awk 'NR==FNR {T[$1$2$3]; next} ($2$3$4 in T) {next} 1' file1 file2

Regarding above code. I just found that if content of file1 is zero then its not working. Ideally even if size of file1 is zero it should return all the content of file2.

RudiC · August 19, 2013, 3:47pm

Yes. If file1 has zero length, it takes file2 as file1 and has nothing left to work upon.
Try

awk 'FILENAME==ARGV[1] {T[$1$2$3]; next} ($2$3$4 in T) {next} 1' file1 file2

Lakshman_Gupta · August 20, 2013, 11:20am

PERFECT Thanks RudiC!!!!