Hello,
I have a file that has more than 300K records (i.e contains numbers). I need to take these records and than search them in 40 files while each file has more than 1.8 million records.
I wrote a script, but its very slow and takes alot of time. I have tried to split my 300k records in 6 files containing 50k records each. And than ran parallel scripts to make search more faster but still searching among 40 files is still slower.
Here is what my script looks like:
t1 --> File containing more than 300k reocrds i.e integer numbers
head t1
3028797272
3028797391
3028797459
3028797826
3028797879
t2* --> more than 40 files while each file has more than 1.8M records.
head t2_DAILY_SDP59.DUMP_subscriber.v3.csv
3048924971,3048924971,0,0,,0,1,2,0,0,0,0,0,0,1,1,1,13,13,,0.000000,1,2014-09-14,2012-09-29,,2014-09-14,0,0,2012-10-09,1,1,,0,0,0,0,0,0,0,0,0,0,0,2012-09-14
3069660757,3069660757,0,0,,1,1,2,0,0,0,0,0,0,1,1,1,13,13,,0.900000,1,2015-04-19,2015-04-19,,2015-04-19,0,0,2015-04-19,0,0,,0,0,0,0,0,0,0,0,0,0,0,2012-10-24
3038103705,3038103705,0,0,,1,1,2,1,0,0,0,0,0,1,1,1,26,26,,9.870000,1,2015-04-25,2015-04-25,,2015-04-25,0,0,2015-04-25,0,0,,0,0,0,0,0,0,0,0,0,0,0,2008-02-25
3038902927,3038902927,0,0,,1,1,2,1,0,0,0,0,0,1,1,1,13,13,,6.460000,1,2015-01-14,2015-01-14,,2015-01-14,0,0,2015-01-14,0,0,,0,0,0,0,0,0,0,0,0,0,0,2011-09-04
#!/bin/bash
for a1 in `cat /dump/20130426_DAILY_SDP/parallelProcessing/t1*`
do
for b1 in `ls /dump/20130426_DAILY_SDP/parallelProcessing/t2*`
do
cat $b1|grep "$a1"|nawk -F "," '{print $1 " " $18}' >> /dump/20130426_DAILY_SDP/parallelProcessing/out_t1.log
grep $a1 /dump/20130426_DAILY_SDP/parallelProcessing/out_t1.log
if [ "$?" -eq "0" ]; then
break 1
fi
done
done
I have tried with PPSS script but its not working properly.
Is there a way to use "Grep efficently", so it could search more faster? System that i am runinng my script on has Solaris OS.
Any help would be much appreciated.
Thanks!!
Regasrds,
Umar