ALINE
ALINE BANG
B ON A
B.B.V.A.
BANG AMER CORG
BANG ON MORENA
BANG ON MORENAIC
BANG ON MORENAICA
BANG ON MORENAICA CORP
BANG ON MORENAICA N.A
[/CODE]
file 2 contains and is seprated by ^ delimiter :
NATIO MARKET^345432534
+ COLUMBUS DISCOVERY in MORENAGO VESPUSSI^999921342
Gadappa'S F315^6716158190
+ SPEEDWAY 0533242 2332492 SPEEDWAY 0534234 352 KETNG CHQ24324435^9392493223
VILA ALINE VILLA ARR 24311605 9900961622^93294932
CHECK # 2193^99939249
online/phone xfr in fr acc 06500518267 date: 04-22-16 time: 11:14:32^45345334
mastermon bang on morena pa cucin new york ny xxxxxxxxxxxx0177^1232131
network printed workign jean pual dum ave long beac ny xxxxxxxxxxxx0177^1232131
master Bangalore petrol bunk metro 070-mt. v washingto dc xxxxxxxxxxxx0177^1232131
[/CODE]
I want file1 string which has limited number of rows to be matched in file2 which has million rows and give me o/p with the count.
I tried below code but it takes lot of time and is not giving proper value in the o/p
file="/opt/sdp/.nikhil/PWD/beta.txt"
while read -r line; do
count=`grep -wi $line /opt/sdp/.nikhil/PWD/alpha.txt|wc -l`
echo $line "|" $count >> opfile.txt
done < "$file"
[/CODE]
o/p i'm getting is incorrect as it is only having aline but it is incrementing the count to +1 in even ALINE BANG of my o/p as shown below which is incorrect similar case with bang on morena as well
ALINE | 1
ALINE BANG | 1
B ON A | 0
B.B.V.A. | 0
BANG AMER CORG | 1
BANG ON MORENA | 1
BANG ON MORENAIC | 1
BANG ON MORENAICA | 1
BANG ON MORENAICA CORP | 1
BANG ON MORENAICA N.A | 1
file=beta.txt
while read -r line
do
count=`grep -wic "$line" alpha.txt`
echo "$line | $count"
done < $file > opfile.txt
It still does PARTIAL matching of ALL fields.
That means if "ALINE BANG" matches, "ALINE" matches also.
If you would restrict the search to a fixed field, to full field matching, to case sensitive matching, ..., all this can help to make it faster.
Thanks for that, it works fine for the smaller number of files, with huge files size varying in 5-6 GB, performance dips gradually.
Is there any alternate approach?
Does performance actually get worse? Or does it just take 100,000x longer to process a 100,000x larger file? About how many matches are you expecting?
There are memory-heavy ways to do it faster, but they're not really applicable to massive files. You could try divide-and-conquer: Run as many simultaneously as your CPU and disks can easily handle, sort their output individually, then merge them in one final step.
If the patterns are always fixed strings the usage of fgrep or grep -F may result in a HUGE Performance Boost.
If possible, run fgrep without -i. That'll get you another Performance Boost and also put LANG=C before the fgrep command, which should speed up things a little too.
The task was similar. The big file had 5.000.000 lines (300 MB). The smaller file had 100.000 lines (3 MB). The results:
Winner fgrep: 7 Seconds
extremeley optimized lua script: 8,6 Seconds
awk-Script: ~97 hours (obviously the great awk-hackers here would get a whole lot more out of awk)
regular grep: stopped after 45 Minutes runtime and 12 GB RAM-Usage
I think the situation is not so far away from this situation here. I suppose the smaller file here is a lot smaller, so the task will not be as cpu-intensive as the other one but this task has a lot more to read(5-6 GB as said by the nikhil).
Thanks a lot for that, but this thing does not ignore the case and do a strong word checking even after options "i" and "w" used.
May be something to do with "F" option, It does overwrite i suppose.
Corona,
File 2 is around 6GB File and File 1 is around 2.4K.