How to make a quick search through a script?

Hello,

I have a file that has more than 300K records (i.e contains numbers). I need to take these records and than search them in 40 files while each file has more than 1.8 million records.

I wrote a script, but its very slow and takes alot of time. I have tried to split my 300k records in 6 files containing 50k records each. And than ran parallel scripts to make search more faster but still searching among 40 files is still slower.

Here is what my script looks like:

t1 --> File containing more than 300k reocrds i.e integer numbers

head t1
3028797272
3028797391
3028797459
3028797826
3028797879

t2* --> more than 40 files while each file has more than 1.8M records.

head t2_DAILY_SDP59.DUMP_subscriber.v3.csv
3048924971,3048924971,0,0,,0,1,2,0,0,0,0,0,0,1,1,1,13,13,,0.000000,1,2014-09-14,2012-09-29,,2014-09-14,0,0,2012-10-09,1,1,,0,0,0,0,0,0,0,0,0,0,0,2012-09-14
3069660757,3069660757,0,0,,1,1,2,0,0,0,0,0,0,1,1,1,13,13,,0.900000,1,2015-04-19,2015-04-19,,2015-04-19,0,0,2015-04-19,0,0,,0,0,0,0,0,0,0,0,0,0,0,2012-10-24
3038103705,3038103705,0,0,,1,1,2,1,0,0,0,0,0,1,1,1,26,26,,9.870000,1,2015-04-25,2015-04-25,,2015-04-25,0,0,2015-04-25,0,0,,0,0,0,0,0,0,0,0,0,0,0,2008-02-25
3038902927,3038902927,0,0,,1,1,2,1,0,0,0,0,0,1,1,1,13,13,,6.460000,1,2015-01-14,2015-01-14,,2015-01-14,0,0,2015-01-14,0,0,,0,0,0,0,0,0,0,0,0,0,0,2011-09-04
#!/bin/bash
for a1 in `cat /dump/20130426_DAILY_SDP/parallelProcessing/t1*`
do
for b1 in `ls /dump/20130426_DAILY_SDP/parallelProcessing/t2*`
do
cat $b1|grep "$a1"|nawk -F "," '{print $1 " " $18}' >> /dump/20130426_DAILY_SDP/parallelProcessing/out_t1.log
grep $a1 /dump/20130426_DAILY_SDP/parallelProcessing/out_t1.log
if [ "$?" -eq "0" ]; then
break 1
fi
done
done

I have tried with PPSS script but its not working properly.

Is there a way to use "Grep efficently", so it could search more faster? System that i am runinng my script on has Solaris OS.

Any help would be much appreciated.

Thanks!!

Regasrds,
Umar

That's a pretty inefficient script!
People would be able to help you better if you would post sample of the file t1 and of one of the t2* files.

here you go

t1 --> File containing more than 300k reocrds i.e integer numbers

head t1
3028797272
3028797391
3028797459
3028797826
3028797879

t2* --> more than 40 files while each file has more than 1.8M records.

head t2_DAILY_SDP59.DUMP_subscriber.v3.csv
3048924971,3048924971,0,0,,0,1,2,0,0,0,0,0,0,1,1,1,13,13,,0.000000,1,2014-09-14,2012-09-29,,2014-09-14,0,0,2012-10-09,1,1,,0,0,0,0,0,0,0,0,0,0,0,2012-09-14
3069660757,3069660757,0,0,,1,1,2,0,0,0,0,0,0,1,1,1,13,13,,0.900000,1,2015-04-19,2015-04-19,,2015-04-19,0,0,2015-04-19,0,0,,0,0,0,0,0,0,0,0,0,0,0,2012-10-24
3038103705,3038103705,0,0,,1,1,2,1,0,0,0,0,0,1,1,1,26,26,,9.870000,1,2015-04-25,2015-04-25,,2015-04-25,0,0,2015-04-25,0,0,,0,0,0,0,0,0,0,0,0,0,0,2008-02-25
3038902927,3038902927,0,0,,1,1,2,1,0,0,0,0,0,1,1,1,13,13,,6.460000,1,2015-01-14,2015-01-14,,2015-01-14,0,0,2015-01-14,0,0,,0,0,0,0,0,0,0,0,0,0,0,2011-09-04

You should really go back and edit the post to include code tags. :slight_smile:

Assuming that I got your requirement right, try:

cd /dump/20130426_DAILY_SDP/parallelProcessing/
awk 'FNR==NR{a[$1];next}$1 in a{print $1,$18}' t1 FS=, t2* > out_t1.log

My aim is to take numbers from file t1 (contains more than 300k records) and search them in 40 t2 files while each file contains 1.8M records.

If number in t1 is in t2 files than I take column 1 and 18th of the found row and save it in separate file. How can I do this efficiently and quickly?

---------- Post updated at 07:37 AM ---------- Previous update was at 07:35 AM ----------

elixir_sinari thanks for your response. Your script gives following error upon execution.

awk 'FNR==NR{a[$1];next}$1 in a{print $1,$18}' t1 FS=, t2* > testunix.log
awk: syntax error near line 1
awk: bailing out near line 1
pwd
/dump/20130426_DAILY_SDP/parallelProcessing

Use nawk on Solaris.

1 Like

Thanks alot elixir_sinari. Your script worked just cool.

Its quite faster than I expected.

Thank!!