Recalculating frequencies

My file looks like this

The first 2 sequences are identical (different ID and frequencies though). The same thing for the last 2. What I need is to compare all sequences within the file and if they are identical, they need to be 'compressed' to one entry and the frequency should be recalculated. Thus, I will end up with the following file

Any help will be greatly appreciated.

Try that:

awk -vRS=">" 'length($0)>0{a[$4]+=$3;b[$4]=$1}END{for (i in a) printf ">"b" Freq "a"\n"i"\n"}' file
1 Like

The last two sequences were not 'combine' into one.
This is what I get

Note that the highlighted sequences are identical (charcater by charecter, not only length) and still were not compressed and consider as 1 entry with s frequency of 13.

That is weird, cause I just tried that on your test data and it did combine those lines. Keep in mind that this command outputs those records in random order. Also double check if you copied the code properly.

I tried one more time and it did not combine the last 2. The order is random but I still can see those 2 sequences. Instead of ending up with 5 differen sequences my file contains 6. I have modified the test data and definitively is not working. I entered 1 more sequence (freq 10) identical to the first 2 at the very end of the file and it did not combine it with the other 2.

Try this to check if 5 or 6 sequences are printed:

awk '!/^>/{a[$0]++}END{for (i in a) print i}' file

The output file contain 6 sequences (2nd and 3rd are identical).

Maybe one of those lines contain space at the end? Or some other nonprintable character? You should probably examine this file with some hex editor (or with vim).

There is something weird I just cannot figure it out.

---------- Post updated at 11:48 PM ---------- Previous update was at 08:26 PM ----------

It works great on CygWin but not with Linux Red Hat.

---------- Post updated 06-30-10 at 01:19 PM ---------- Previous update was 06-29-10 at 11:48 PM ----------

I have fixed the problem!