Please suggest alternative to grep

niladri29 · June 7, 2012, 12:38am

Hi Experts,
PFB my requirement:
I have a file (named file1) containing numbers like:

372846078543002
372846078543003
372846078543004
372846078543005
372846078543006

I have another file (nemed file2)where lines containing these numbers(present in file1) are present; Eg:

lppza087; [2012-06-05 03:00:01,090] <PERFB  > JMSId :ID:414d51204c50505a41303837202020204f657ff1299e7bb7 SvcName :realtime.get.relationship Port :Port1 LobId :AMCSGCDERTUSUSD Card :372846078543002 SrcCd :16 versionNum :3.0 OO [MessageListenerThreadPool : 11  ] OO dao.CustDAO                       OO                 getRelnDetails() OO Entry : getRelnDetails
lppza087; [2012-06-05 03:00:01,100] <PERFB  > JMSId :ID:414d51204c50505a41303837202020204f657ff1299e7bb7 SvcName :realtime.get.relationship Port :Port1 LobId :AMCSGCDERTUSUSD Card :372846078543003 SrcCd :16 versionNum :3.0 OO [MessageListenerThreadPool : 11  ] OO dao.CustDAO                       OO                 getRelnDetails() OO Exit  : getRelnDetails

I need to grep all those lines present in file1 from the other file (file2).
One way will be to run a for loop on file1 and grep in file2. But my data volume is very high an it's taking 5-6 hours.
Can you please suggest the fastest way to achieve this (may be using awk/sed)

guruprasadpr · June 7, 2012, 12:46am

Hi

grep -f file1 file2

Guru.

niladri29 · June 7, 2012, 1:01am

Thanks Guru for your prompt response
But my 2nd file size is 15 GB, and the 1st file size is 5 GB. So just wanted to know can this proces be made faster.
I was also was wondering if the lines (as obtained from file2) can be arranged as per the search lines present in file1.

pludi · June 7, 2012, 8:11am

grep, sed, and awk would all do the same thing: read the first file line by line, and check the second file for occurrences each time, chugging through approximately 75 GB (15*5) of data.

One way it could be done faster would be a script/program that reads the second file (which looks like the wanted information is in the same place on every line), creates a hash/list of the numbers and according line numbers, and the only has to go through the first file once.

drl · June 7, 2012, 9:56am

Hi.

First question is does this absolutely need to be faster? How many times are you going to run it? If it's a single-shot, then perhaps just letting it run to completion is the best solution.

Secondly, the first file looks like it is a sequence. If so, then perhaps a regular expression could be used rather than a volume of 5 GB of memory. If not a regular expression, then possibly a code that determines if a piece of the line matches the base + the sequence -- an arithmetic operation, which might be faster than string comparisons (for example, some mainframes & supercomputers had multiple units for arithmetic).

Thirdly, if you have sufficient IO throughput as well as multiple cores, then one could write a program that internally divides the main file into pieces by keeping track of start-stop line positions, and then uses processes or threads to process one segment each. A less elegant solution along the same lines would be to spilt the files into n sections, each in a file, and then run n instances of grep.

Fourthly, splitting the task up among a network of machines that might share the disk; as well as the easiest (but not cheapest) solution: get a faster box.

Best wishes ... cheers, drl

binlib · June 7, 2012, 2:23pm

Assuming both files are sorted, maybe you can use "join".
If all the 300 million numbers of file1 start with 372846 (if not, then multiple passes maybe), then you can treat them as integers (minus the prefix). This way you can store them as bitmaps and do look up of the numbers (check prefix first separately) from file2. The first chapter of Jon Bentley's book "programming pearl" talked exactly about this problem.

niladri29 · June 18, 2012, 6:41am

"fgrep -f file1 file2 " worked for me

Thanks,
Niladri