Count number of occurences using awk

Hi Guys,

I have 2 files like below

file1
xx
yy


file2
b
yy
b2
xx
c1
yy
xx
yy

Now I want an idea which can count occurences of text from file1 and file2 so outbout would be kind of

xx-2
yy-3

I know this is possible using awk or grep -c but I am not able to get desired output.

Any suggestion would be really appreciated.

Something which works on your sample data:

awk 'FNR==NR{c[$1];next}$1 in c{++c[$1]}END{for(i in c) print i"-"c}' file1 file2

I forgot to mention that file2 is quite big in GB so what would be good to split it first in smaller pieces and run on all of them parallel to get results faster?

The size of file2 doesn't matter much. How big is file1?

file 2 will be 120 million records
file 1 will be 11 million records

Why don't you try the command/script and then worry about the size?

As you can see size of files if I run script without splitting them then it might take hours to complete i think. I want this result to come within 15-30 minutes max.

Please try it and then let us know about problems, if any.

This will work for the small files you presented above:

$ sort file2 | uniq -c | grep -f file1
      2 xx
      3 yy

On the other hand, grep ping umpteen million lines with 11 million fixed strings will be seriously demanding, if doable at all.

Hey,

I think it didnt work for large files. Here are specifications

file1.txt
bash-3.00# cat file1.txt |wc -l
 17102666

more file1.txt
123advertise3
123advertise4
123advertise5
123advertiseb
123advertisec
123advertised
123advertisedebtconsolidation
123advertisee
123advertisef
123advertiseg
123advertiseh
123advertisehomaxproducts
file2.txt
bash-3.00#cat file2.txt | wc -l
 113842500


more file2.txt
123123apartment
123123attorney
123123auction
123123auto
123advertisedebtconsolidation
123advertiseb
123123automate
123123automatic
123123bank
123advertisedebtconsolidation
123advertiseb
123123banking
123123bankruptcy
123advertisedebtconsolidation
123123bargain
123123best
123123blog
123advertisedebtconsolidation
123123building

I ran below command as described above

bash-3.00# nawk 'FNR==NR{c[$1];next}$1 in c{++c[$1]}END{for(i in c) print i,c}' file1.txt file2.txt

but I got only 36000 lines only as per below format. However, I wanted output like word: <number of occurances>

peaktablethomecsuchico
browsepropertyhomebase
clickflowershomedsn
worldwideflowerstravelagency
acepigb
acepigc
browsecompanytravelagent
liveearnhomedownpaymentassistance
acepigd
bargainsystemhomebvcure
acepige
acepigf
uniquecasinohomecycling
alternativeanyhomecanningrecipes
acepigj
annualsurveyhomedma

Can somebody help me here please for larger files?

---------- Post updated at 05:06 AM ---------- Previous update was at 12:27 AM ----------

Please help me.I am stuck.