Recalculating frequencies

Xterra · June 29, 2010, 4:38pm

My file looks like this

The first 2 sequences are identical (different ID and frequencies though). The same thing for the last 2. What I need is to compare all sequences within the file and if they are identical, they need to be 'compressed' to one entry and the frequency should be recalculated. Thus, I will end up with the following file

Any help will be greatly appreciated.

bartus11 · June 29, 2010, 4:54pm

Try that:

awk -vRS=">" 'length($0)>0{a[$4]+=$3;b[$4]=$1}END{for (i in a) printf ">"b" Freq "a"\n"i"\n"}' file

Xterra · June 29, 2010, 5:14pm

The last two sequences were not 'combine' into one.
This is what I get

Note that the highlighted sequences are identical (charcater by charecter, not only length) and still were not compressed and consider as 1 entry with s frequency of 13.

bartus11 · June 29, 2010, 5:22pm

That is weird, cause I just tried that on your test data and it did combine those lines. Keep in mind that this command outputs those records in random order. Also double check if you copied the code properly.

Xterra · June 29, 2010, 5:29pm

I tried one more time and it did not combine the last 2. The order is random but I still can see those 2 sequences. Instead of ending up with 5 differen sequences my file contains 6. I have modified the test data and definitively is not working. I entered 1 more sequence (freq 10) identical to the first 2 at the very end of the file and it did not combine it with the other 2.

bartus11 · June 29, 2010, 5:33pm

Try this to check if 5 or 6 sequences are printed:

awk '!/^>/{a[$0]++}END{for (i in a) print i}' file

Xterra · June 29, 2010, 5:40pm

The output file contain 6 sequences (2nd and 3rd are identical).

bartus11 · June 29, 2010, 5:50pm

Maybe one of those lines contain space at the end? Or some other nonprintable character? You should probably examine this file with some hex editor (or with vim).

Xterra · June 30, 2010, 1:19pm

There is something weird I just cannot figure it out.

---------- Post updated at 11:48 PM ---------- Previous update was at 08:26 PM ----------

It works great on CygWin but not with Linux Red Hat.

---------- Post updated 06-30-10 at 01:19 PM ---------- Previous update was 06-29-10 at 11:48 PM ----------

I have fixed the problem!