Speeding up shell script with grep

dunryc · March 25, 2017, 4:37am

HI Guys hoping some one can help

I have two files on both containing uk phone numbers

master is a file which has been collated over a few years ad currently contains around 4 million numbers

new is a file which also contains 4 million number i need to split new nto two separate files one which contains number that already exist in master and one which contains numbers that don't already exist in master. I can do this but it takes around 80 hours to complete !! can any one offer any suggestions on how to speed this up ?

while read -r phone_number; do

        echo  "checking master for phone number $phone_number"

        if grep "$phone_number" master.csv; then
  
           echo "$phone_number already exists in master file "
          echo "$phone_number" >> unusable_numbers    
        else

           echo "$phone_number looks good we can use this saving to usable_numbers"
           echo "$phone_number" >> usable_numbers 

        fi

done < new

any help would be gratefully appreciated

bakunin · March 25, 2017, 5:27am

Use grep -f : instead of writing a shell loop grep can do it in one step (and presumably quite faster):

grep -f new_phone master > in_new_phone_and_master

You might want to try this with some small files to get a feeling for what it produces. You can also use all the other grep -options in conjunction withh this, especially "-F" (use fixed strings for matching), which speeds grep s operation up considerably. Also notice the -v -option which inverts the outcome. See the man-page of grep for details.

I hope this helps.

bakunin

Don_Cragun · March 25, 2017, 6:07am

In addition to what bakunin suggested, you might also consider the following to more closely match the output produced by your current script...

Making the wild assumptions that:

master.csv is a character separated values file with comma as the character separating fields, and
the field containing the phone number in master.csv is the 1st field

how long does the following script take:

awk -v fn=1 -F',' '
FNR == NR {
	p[$fn]
	next
}
{	print "checking master for phone number", $1
	if($1 in p) {
		print $1, "already exists in master file "
		print > "unusable_numbers"
	} else {
		print $1, "looks good we can use this saving to usable_numbers"
		print > "useable_numbers"
	}
}' master.csv new

to do the same job?

If the field number in master.csv is not the 1st field, change the value assigned to the fn variable from 1 to the field number of the field containing the phone number.

If the field separators in master.csv are not commas, change the character in the -F option-argument to the desired character.

drysdalk · March 25, 2017, 6:15am

EDIT: Sorry, forget my solution, there was a flaw in my logic entirely in that script that meant it wasn't suitable for purpose, and certainly wasn't any faster. Apologies !

RudiC · March 25, 2017, 6:20am

Using grep -f works, but with two files as large as indicated will take its (serious) time, and may eventually run out of memory. Try

sort master.csv new | uniq -d

, then use the resultant file in similar way ( uniq -u ) to extract unique values from either original file.
Comparison of both approaches on ~20k files:

time grep -ffile2 file1
real    0m0.352s
user    0m0.280s
sys     0m0.052s
time sort file[12] | uniq -d
real    0m0.037s
user    0m0.032s
sys     0m0.004s

EDIT: Times spent for two files with roughly 4E6 entries each, and about 1E6 lines overlap (on a two processor linux host):

time sort file[12] | uniq -d > f1
real    0m14.975s
user    0m27.048s
sys     0m0.792s
time sort file1 f1 | uniq -u > f2
real    0m9.331s
user    0m16.488s
sys     0m0.572s