Another frustrating scripting problem from a biologist trying to manipulate a file with several millions line. For each of the line I need to compare the uppercase A or C or G or T with the lowercase a or c or g or t. If there are more uppercases, a + should be added to a new column, otherwise a - is added. Many of the lines are duplicated or even triplicated, etc... This is to allow the comparison of only one character at a time in the order of ACGT. And to make it even more complicated, comparison on the last line of the repeated lines should be between the . and , where if there are more . than , a + should be added.
Below are the examples of some of my data. The columns with numbers are the count of uppercase ACGT and lowercase acgt respectively.
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0
And this is what I'll like to get:
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0 +
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0 +
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0 +
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0 +
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0 -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0 +
I've tried awk with if conditions but I guess it is too simple. Any suggestions or help will be very much appreciated!