johjoh
1
say i've got a text file with >10million sequences:
ssss
ssss
tttttt
uuuuuu
uuuuuu
uuuuuu
...
I'd like to convert the file so that the output will report the number of occurence right by each sequence:
2 ssss
2 ssss
1 tttttt
3 uuuuuu
3 uuuuuu
3 uuuuuu
....
Is there an easy way to do this. there are 10 million lines, so I can't really use loops.
thanks!
johjoh
3
thanks for your reply.
however, that will result in:
2 ssss
1 tttttt
3 uuuuuu
what i'm doing so far is:
uniq -c, and then something like awk '{for(j=1;j<=$1;j++){print $1, $2} }' file.
however, this will take for ever to finish when dealing with a large file. is there a better way to do this??
thanks
not sure whether below can help you some?
sort a.txt | uniq -c | awk '{for(i=1;i<=$1;i++)
print $0
}'