How to make a quick search through a script?

umarsatti · April 28, 2013, 7:53am

Hello,

I have a file that has more than 300K records (i.e contains numbers). I need to take these records and than search them in 40 files while each file has more than 1.8 million records.

I wrote a script, but its very slow and takes alot of time. I have tried to split my 300k records in 6 files containing 50k records each. And than ran parallel scripts to make search more faster but still searching among 40 files is still slower.

Here is what my script looks like:

t1 --> File containing more than 300k reocrds i.e integer numbers

t2* --> more than 40 files while each file has more than 1.8M records.

head t2_DAILY_SDP59.DUMP_subscriber.v3.csv
3048924971,3048924971,0,0,,0,1,2,0,0,0,0,0,0,1,1,1,13,13,,0.000000,1,2014-09-14,2012-09-29,,2014-09-14,0,0,2012-10-09,1,1,,0,0,0,0,0,0,0,0,0,0,0,2012-09-14
3069660757,3069660757,0,0,,1,1,2,0,0,0,0,0,0,1,1,1,13,13,,0.900000,1,2015-04-19,2015-04-19,,2015-04-19,0,0,2015-04-19,0,0,,0,0,0,0,0,0,0,0,0,0,0,2012-10-24
3038103705,3038103705,0,0,,1,1,2,1,0,0,0,0,0,1,1,1,26,26,,9.870000,1,2015-04-25,2015-04-25,,2015-04-25,0,0,2015-04-25,0,0,,0,0,0,0,0,0,0,0,0,0,0,2008-02-25
3038902927,3038902927,0,0,,1,1,2,1,0,0,0,0,0,1,1,1,13,13,,6.460000,1,2015-01-14,2015-01-14,,2015-01-14,0,0,2015-01-14,0,0,,0,0,0,0,0,0,0,0,0,0,0,2011-09-04

#!/bin/bash
for a1 in `cat /dump/20130426_DAILY_SDP/parallelProcessing/t1*`
do
for b1 in `ls /dump/20130426_DAILY_SDP/parallelProcessing/t2*`
do
cat $b1|grep "$a1"|nawk -F "," '{print $1 " " $18}' >> /dump/20130426_DAILY_SDP/parallelProcessing/out_t1.log
grep $a1 /dump/20130426_DAILY_SDP/parallelProcessing/out_t1.log
if [ "$?" -eq "0" ]; then
break 1
fi
done
done

I have tried with PPSS script but its not working properly.

Is there a way to use "Grep efficently", so it could search more faster? System that i am runinng my script on has Solaris OS.

Any help would be much appreciated.

Thanks!!

Regasrds,
Umar

elixir_sinari · April 28, 2013, 8:02am

That's a pretty inefficient script!
People would be able to help you better if you would post sample of the file t1 and of one of the t2* files.

umarsatti · April 28, 2013, 8:05am

here you go

t1 --> File containing more than 300k reocrds i.e integer numbers

t2* --> more than 40 files while each file has more than 1.8M records.

head t2_DAILY_SDP59.DUMP_subscriber.v3.csv
3048924971,3048924971,0,0,,0,1,2,0,0,0,0,0,0,1,1,1,13,13,,0.000000,1,2014-09-14,2012-09-29,,2014-09-14,0,0,2012-10-09,1,1,,0,0,0,0,0,0,0,0,0,0,0,2012-09-14
3069660757,3069660757,0,0,,1,1,2,0,0,0,0,0,0,1,1,1,13,13,,0.900000,1,2015-04-19,2015-04-19,,2015-04-19,0,0,2015-04-19,0,0,,0,0,0,0,0,0,0,0,0,0,0,2012-10-24
3038103705,3038103705,0,0,,1,1,2,1,0,0,0,0,0,1,1,1,26,26,,9.870000,1,2015-04-25,2015-04-25,,2015-04-25,0,0,2015-04-25,0,0,,0,0,0,0,0,0,0,0,0,0,0,2008-02-25
3038902927,3038902927,0,0,,1,1,2,1,0,0,0,0,0,1,1,1,13,13,,6.460000,1,2015-01-14,2015-01-14,,2015-01-14,0,0,2015-01-14,0,0,,0,0,0,0,0,0,0,0,0,0,0,2011-09-04

hanson44 · April 28, 2013, 8:06am

You should really go back and edit the post to include code tags.

elixir_sinari · April 28, 2013, 8:11am

Assuming that I got your requirement right, try:

cd /dump/20130426_DAILY_SDP/parallelProcessing/
awk 'FNR==NR{a[$1];next}$1 in a{print $1,$18}' t1 FS=, t2* > out_t1.log

umarsatti · April 28, 2013, 8:37am

My aim is to take numbers from file t1 (contains more than 300k records) and search them in 40 t2 files while each file contains 1.8M records.

If number in t1 is in t2 files than I take column 1 and 18th of the found row and save it in separate file. How can I do this efficiently and quickly?

---------- Post updated at 07:37 AM ---------- Previous update was at 07:35 AM ----------

elixir_sinari thanks for your response. Your script gives following error upon execution.

awk 'FNR==NR{a[$1];next}$1 in a{print $1,$18}' t1 FS=, t2* > testunix.log
awk: syntax error near line 1
awk: bailing out near line 1
pwd
/dump/20130426_DAILY_SDP/parallelProcessing

elixir_sinari · April 28, 2013, 8:37am

Use nawk on Solaris.

umarsatti · April 28, 2013, 8:53am

Thanks alot elixir_sinari. Your script worked just cool.

Its quite faster than I expected.

Thank!!