Count number of occurences using awk

prashant2507198 · April 14, 2013, 9:41am

Hi Guys,

I have 2 files like below

file1
xx
yy


file2
b
yy
b2
xx
c1
yy
xx
yy

Now I want an idea which can count occurences of text from file1 and file2 so outbout would be kind of

xx-2
yy-3

I know this is possible using awk or grep -c but I am not able to get desired output.

Any suggestion would be really appreciated.

elixir_sinari · April 14, 2013, 9:46am

Something which works on your sample data:

awk 'FNR==NR{c[$1];next}$1 in c{++c[$1]}END{for(i in c) print i"-"c}' file1 file2

prashant2507198 · April 14, 2013, 9:51am

I forgot to mention that file2 is quite big in GB so what would be good to split it first in smaller pieces and run on all of them parallel to get results faster?

elixir_sinari · April 14, 2013, 9:52am

The size of file2 doesn't matter much. How big is file1?

prashant2507198 · April 14, 2013, 9:59am

file 2 will be 120 million records
file 1 will be 11 million records

elixir_sinari · April 14, 2013, 10:02am

Why don't you try the command/script and then worry about the size?

prashant2507198 · April 14, 2013, 10:03am

As you can see size of files if I run script without splitting them then it might take hours to complete i think. I want this result to come within 15-30 minutes max.

elixir_sinari · April 14, 2013, 10:05am

Please try it and then let us know about problems, if any.

RudiC · April 14, 2013, 3:51pm

This will work for the small files you presented above:

$ sort file2 | uniq -c | grep -f file1
      2 xx
      3 yy

On the other hand, grep ping umpteen million lines with 11 million fixed strings will be seriously demanding, if doable at all.

prashant2507198 · April 15, 2013, 6:06am

Hey,

I think it didnt work for large files. Here are specifications

file1.txt
bash-3.00# cat file1.txt |wc -l
 17102666

more file1.txt
123advertise3
123advertise4
123advertise5
123advertiseb
123advertisec
123advertised
123advertisedebtconsolidation
123advertisee
123advertisef
123advertiseg
123advertiseh
123advertisehomaxproducts

file2.txt
bash-3.00#cat file2.txt | wc -l
 113842500


more file2.txt
123123apartment
123123attorney
123123auction
123123auto
123advertisedebtconsolidation
123advertiseb
123123automate
123123automatic
123123bank
123advertisedebtconsolidation
123advertiseb
123123banking
123123bankruptcy
123advertisedebtconsolidation
123123bargain
123123best
123123blog
123advertisedebtconsolidation
123123building

I ran below command as described above

bash-3.00# nawk 'FNR==NR{c[$1];next}$1 in c{++c[$1]}END{for(i in c) print i,c}' file1.txt file2.txt

but I got only 36000 lines only as per below format. However, I wanted output like word: <number of occurances>

peaktablethomecsuchico
browsepropertyhomebase
clickflowershomedsn
worldwideflowerstravelagency
acepigb
acepigc
browsecompanytravelagent
liveearnhomedownpaymentassistance
acepigd
bargainsystemhomebvcure
acepige
acepigf
uniquecasinohomecycling
alternativeanyhomecanningrecipes
acepigj
annualsurveyhomedma

Can somebody help me here please for larger files?

---------- Post updated at 05:06 AM ---------- Previous update was at 12:27 AM ----------

Please help me.I am stuck.