Finding unique entries without sorting

npatwardhan · February 2, 2010, 2:07pm

Hi Guys,

I have two files that I am using:

File1 is as follows:

 
wwe
khfgv
jfo
jhgfd
hoaha
hao
lkahe

This is like a master file which has entries in the order which I want.

File 2 looks like this:

wwe
khfgv
jfo
wwe
jhgfd
wwe
wwe
hoaha
hao
lkahe
hoaha
wwe
hao

So I want to parse the second file and count the occurence of each of the entries in the second file. Then I want a third file which has the following: (the number adjacent to each entry is the number of times that entry has occured in the file)

wwe  5
khfgv 1
jfo     1
jhgfd  1
hoaha 2
hao    2
lkahe  1

I tried sort|uniq-c but that reorders the file. Is there an easier way to find unique entries without sorting the file?

Thanks in advance.

Franklin52 · February 2, 2010, 2:16pm

awk 'NR==FNR{a[$0]++;next}a[$0]{print $0, a[$0]}' file2 file1

EAGL · February 2, 2010, 2:33pm

Hi Franklin,

Can you please explain briefly how does this part work in your code

awk 'NR==FNR{a[$0]++;next}.....

thanks in advance

cmf1985 · February 2, 2010, 3:16pm

Or in a perhaps more familiar way, you could try something like this:

> file3.txt

cat file1.txt | while read line
do
        occurences=`grep -c "$line" file2.txt`
        echo  "$line $occurences" >> file3.txt
done

Obviously not as concise as the awk version but maybe a little easier to understand if you're a beginner.

Franklin52 · February 2, 2010, 3:23pm

awk 'NR==FNR{a[$0]++;next}

If we read file2, increase array a[$0]. This is how it works:

line 1: a[wwe]++ == 1
line 2: a[khgv]++ == 1
line 3: a[jfo]++ == 1
line 4: a[wwe]++ == 2
line 5: a[jhgfd]++ == 1
line 6: a[wwe]++ == 3
line 7: a[wwe]++ == 4
.
.