Processing too slow with loop

nikhil_jain · June 23, 2016, 4:13am

I have 2 files

file 1 : contains

ALINE
ALINE BANG
B ON A
B.B.V.A.
BANG AMER CORG
BANG ON MORENA
BANG ON MORENAIC
BANG ON MORENAICA
BANG ON MORENAICA CORP
BANG ON MORENAICA N.A

[/CODE]

file 2 contains and is seprated by ^ delimiter :

NATIO MARKET^345432534
+ COLUMBUS DISCOVERY in MORENAGO VESPUSSI^999921342
Gadappa'S F315^6716158190
+ SPEEDWAY 0533242 2332492 SPEEDWAY 0534234 352 KETNG CHQ24324435^9392493223
VILA ALINE VILLA ARR 24311605 9900961622^93294932
CHECK # 2193^99939249
online/phone xfr in fr acc 06500518267 date: 04-22-16 time: 11:14:32^45345334
mastermon bang on morena pa cucin new york ny xxxxxxxxxxxx0177^1232131
network printed workign  jean pual dum ave long beac ny xxxxxxxxxxxx0177^1232131
master Bangalore petrol bunk metro 070-mt. v washingto dc xxxxxxxxxxxx0177^1232131

[/CODE]

I want file1 string which has limited number of rows to be matched in file2 which has million rows and give me o/p with the count.

I tried below code but it takes lot of time and is not giving proper value in the o/p

file="/opt/sdp/.nikhil/PWD/beta.txt"
while read -r line; do
    count=`grep -wi $line /opt/sdp/.nikhil/PWD/alpha.txt|wc -l`
echo $line "|" $count >>  opfile.txt
done < "$file"

[/CODE]

o/p i'm getting is incorrect as it is only having aline but it is incrementing the count to +1 in even ALINE BANG of my o/p as shown below which is incorrect similar case with bang on morena as well

ALINE | 1
ALINE BANG | 1
B ON A | 0
B.B.V.A. | 0
BANG AMER CORG | 1
BANG ON MORENA | 1
BANG ON MORENAIC | 1
BANG ON MORENAICA | 1
BANG ON MORENAICA CORP | 1
BANG ON MORENAICA N.A | 1

[/CODE]

MadeInGermany · June 23, 2016, 5:01am

The following fixes a few issues

file=beta.txt
while read -r line
do
   count=`grep -wic "$line" alpha.txt`
   echo "$line | $count"
done < $file > opfile.txt

It still does PARTIAL matching of ALL fields.
That means if "ALINE BANG" matches, "ALINE" matches also.
If you would restrict the search to a fixed field, to full field matching, to case sensitive matching, ..., all this can help to make it faster.

RudiC · June 23, 2016, 5:27am

How about

grep -oif file1 file2 | sort | uniq -c

nikhil_jain · June 23, 2016, 9:36am

Rudi,

Thanks for that, it works fine for the smaller number of files, with huge files size varying in 5-6 GB, performance dips gradually.
Is there any alternate approach?

MadeinGermany -- Thanks

Corona688 · June 23, 2016, 2:06pm

Does performance actually get worse? Or does it just take 100,000x longer to process a 100,000x larger file? About how many matches are you expecting?

There are memory-heavy ways to do it faster, but they're not really applicable to massive files. You could try divide-and-conquer: Run as many simultaneously as your CPU and disks can easily handle, sort their output individually, then merge them in one final step.

joker · June 23, 2016, 3:06pm

If the patterns are always fixed strings the usage of fgrep or grep -F may result in a HUGE Performance Boost.

If possible, run fgrep without -i. That'll get you another Performance Boost and also put LANG=C before the fgrep command, which should speed up things a little too.

Sidenote

There was a scripting task request in the german linux forum(www.linuxforen.de) here: Linuxforen.de Thread regarding fgrep

The task was similar. The big file had 5.000.000 lines (300 MB). The smaller file had 100.000 lines (3 MB). The results:

Winner fgrep: 7 Seconds
extremeley optimized lua script: 8,6 Seconds
awk-Script: ~97 hours (obviously the great awk-hackers here would get a whole lot more out of awk)
regular grep: stopped after 45 Minutes runtime and 12 GB RAM-Usage

I think the situation is not so far away from this situation here. I suppose the smaller file here is a lot smaller, so the task will not be as cpu-intensive as the other one but this task has a lot more to read(5-6 GB as said by the nikhil).

nikhil_jain · June 24, 2016, 2:48am

Stomp,

Thanks a lot for that, but this thing does not ignore the case and do a strong word checking even after options "i" and "w" used.
May be something to do with "F" option, It does overwrite i suppose.

Corona,

File 2 is around 6GB File and File 1 is around 2.4K.

joker · June 24, 2016, 6:37am

Too bad. fgrep does not work with "-w"

nikhil_jain · June 24, 2016, 10:01am

Yep got to know thru google, any other solution ? which will help in performance booster.

Corona688 · June 24, 2016, 11:13am

It makes sense that "word regexp" and "fixed string" are mutually exclusive. I don't know what combining them would even mean.

Is the data file structured in any way? That could be useful.

You answered me, but didn't answer my question:

Corona688 · June 24, 2016, 11:26am

Please don't ask technical questions in private messages.

By 'structure of the file', I mean, is the data file you're searching for, a specific column in a flat text file or some such?