counting the number of occurences

johjoh · April 28, 2009, 12:45am

say i've got a text file with >10million sequences:

ssss
ssss
tttttt
uuuuuu
uuuuuu
uuuuuu
...

I'd like to convert the file so that the output will report the number of occurence right by each sequence:

2 ssss
2 ssss
1 tttttt
3 uuuuuu
3 uuuuuu
3 uuuuuu
....

Is there an easy way to do this. there are 10 million lines, so I can't really use loops.

thanks!

amitranjansahu · April 28, 2009, 1:10am

use uniq

uniq -c filename

johjoh · April 28, 2009, 1:20am

thanks for your reply.

however, that will result in:

2 ssss
1 tttttt
3 uuuuuu

what i'm doing so far is:
uniq -c, and then something like awk '{for(j=1;j<=$1;j++){print $1, $2} }' file.

however, this will take for ever to finish when dealing with a large file. is there a better way to do this??

thanks

summer_cherry · April 28, 2009, 6:14am

not sure whether below can help you some?

sort a.txt | uniq -c | awk '{for(i=1;i<=$1;i++)
        print $0
        }'