Each row/line will have 33 letters and each line will only have multiple occurrences of etters out of a pool of ATGC (also small atgc). some may have also '-'. I would like to extract those lines (rows) that have a non-homogenious letters or if one or more letter is different compared to the rest, grap that entire column.
Each row/line will have 32 letters and each line will only have multiple occurrences of 2 or more letters out of a pool of ATGC (also small atgc). some may have also '-'. I would like to count the occurrence of each alphabet in a line and output the position number/ numbers of all the counted alphabet.
How do I modify the above code so that it would count the occurrence of the alphabet that is different from the first alphbet in each of the lines and output the position number/ numbers of that different alphabet only.
Desired output
CCCCCGCCCCCCCCCCcCCCCCCCCCCCCCCC G 6
AAAATAAAAAAAAAAAaAAAAAAAAAAAAAA T 5
TGTTTTTTTTTTTTGGtTTTTTTTTTTTTTTT G 2 15 16
TTTT-TTTTTTTTTCTtTTTTTTTTTTTTTTT C 15
---------- Post updated at 08:20 PM ---------- Previous update was at 04:51 PM ----------
That was based on the smallest occurrence. But here I need the count of the alphabet that is different of the first one. It is not necessary that the different alphabet will have the smallest occurrence in each line.
But it should be obvious that the two problems are similar, and there is a good chance that the solutions might be similar.
I understand that you're working on genes and that computer science is not your area of expertise. But, this forum is intended to help people learn how to effectively use Linux and UNIX system; not to be an unpaid pool of programmers to write software for your genetic research projects. With well over 150 posts to these forums, and our help in solving more than 75 issues you have raised, and watching what goes on in the UNIX & Linux Forums; we would expect that by now you have started to learn something from all of our help.
What have you tried to do to solve this problem on your own?