Uppercase/lowercase comparison of one character per line with awk??

Another frustrating scripting problem from a biologist trying to manipulate a file with several millions line. For each of the line I need to compare the uppercase A or C or G or T with the lowercase a or c or g or t. If there are more uppercases, a + should be added to a new column, otherwise a - is added. Many of the lines are duplicated or even triplicated, etc... This is to allow the comparison of only one character at a time in the order of ACGT. And to make it even more complicated, comparison on the last line of the repeated lines should be between the . and , where if there are more . than , a + should be added.

Below are the examples of some of my data. The columns with numbers are the count of uppercase ACGT and lowercase acgt respectively.

.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0

And this is what I'll like to get:

.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0  +
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0  +
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0  +
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0  -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0  -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0  -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0  -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  +
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  +

I've tried awk with if conditions but I guess it is too simple. Any suggestions or help will be very much appreciated!

If your problem description is correct, shouldn't the final three lines in your sample data end with --+ instead of +-+ ? the first two minus signs because lowercase outnumbers uppercase, and the final plus because it is the last of a series of dupes, which triggers the commad-dot comparison rule, and since there are more dots a plus should end it.

Instead of:

Should it not be:

.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0  +

Or, perhaps I misunderstood.

Regards,
Alister

No. There is only one A, i.e. 1 in column 2 and 0 in column 5.

The order of comparison should be A followed by C followed by G and finally by T. If A is found in the first line, it should not be compared again in the next line and so on...

Thank you.

ivpz, perhaps this will do:

$ cat dna.awk 
{
    for (i=2; i<=5; i++) {
        if ($i || $(i+4)) {
            print $0, ($i>$(i+4) ? "+" : "-")
            getline
        }
    }
    print $0, (split($0, a, /\./) > split($0, a, /,/) ? "+" : "-")
}


$ awk -f dna.awk data
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0 +
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0 +
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0 +
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0 +
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0 -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0 +

Why column 5 ? If A, C, G, T are in columns 2, 3, 4, 5, then shouldn't "a" be in column 6 ?

Here's the line no. 8 (from the top) of your original post:

....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0

Now, aren't the fields supposed to mean the following ?

Column 1                            Column 2   Column 3   Column 4   Column 5   Column 6   Column 7   Column 8   Column 9
The data                            "A" count  "C" count  "G" count  "T" count  "a" count  "c" count  "g" count  "t" count
=================================   =========  =========  =========  =========  =========  =========  =========  =========
.....,,..,,...,,......,...cA.c,cC.  1          1          0          0          0          3          0          0

So, count of all uppercase characters (A, C, G, T) in columns 2, 3, 4, 5 respectively = 1 + 1 + 0 + 0 = 2

Count of all lowercase characters (a, c, g, t) in columns 6, 7, 8, 9 respectively = 0 + 3 + 0 + 0 = 3

Hence, shouldn't line 8 be followed by "-" because count of uppercase characters < count of lowercase characters ?

tyler_durden

Hey, durden_tyler:

From what I gathered, the columns mean what you think they mean, but comparison is only made between corresponding upper-lower case letters (A-a, C-c, G-g, T-t) wherein at least one member of the pair occurs in the line. Also, there are as many duplicates of each line as there are comparisons to be made.

Line 8 will have appended to it the result of comparing A-a (columns 2 and 6), a "+". Line 9 is C-c (columns 3 and 7), and gets "-". Line 10 is for the comma-dot comparison (in this case, a "+"). If there are no instances of either member of a pair, there is no comparison made and no line is dupe appears for it.

alister

Sorry, my mistake. Yes, it should be column 6.

Each Uppercase A should be compared with the lowercase a only; in essence:

compare col2 and col6; if col2>col6, add + else - to a new col. If both col2 and col6 are 0 then compare col3 and col7 ...

---------- Post updated at 04:27 AM ---------- Previous update was at 04:20 AM ----------

Alister, thanks for your help. Can I ask you, what is the function of a, in the last line of the script?

Also, when I ran this script, most of the comparisons seems to be correct but I got a few which are obviously incorrect and one extra line was added at the end:

,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 +
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 +
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 -

should be
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 -
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 +
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 +

Another example:

...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 -
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +

should be:
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 -
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +

Although some of the lines contains characters other than the 4 ACGT, they should be ignored.

Hey, ivpz:

a is an array which holds the results of the split operation. It isn't used for anything except that split() requires a place to put the fields it creates. I'm only interested in the return value which indicates how many elements are in the array, which tells me if there are more commas or more periods.

Regarding the errors in the script's output, it may well be related to those lines which should be ignored. You didn't mention any "rogues" in your original post. The code's logic assumes that the file is well-formed, i.e. every line will be used and that they number exactly as required for the number of comparisons to be made.

Could you please provide an example or two of the "rogue" lines which should be ignored? Do they occur between sets of dupes or embedded within them? A minimal data sample with these special cases would help me help you. I assume when you mean that they include letters outside AaCcGgTt that doesn't include the '^F' I'm seeing in your sample data. Is that a form-feed control character in the data or is it a literal caret ("^") followed by a literal upper case eff ("F")?

alister

Hi Alister,

I don't think any of those characters cause the error. Many of the lines have them too and the the output results are ok. When I took out those lines with incorrect output and reran them with the script the results were fine. Looking at the lines before I came to realise that there are some lines with no duplication, i.e. they are unique with no . or ,

Correct me if I'm wrong, the last line of your script will always look for duplicate since it is an array statement? Any suggestion how to modify this line?

Below is the example I'm talking about:

ccCCcc$c$cCC$CC$ccccCc$CCCccccCcccccCCCcCCcCccCccCCCCCCCcCcCCcCCCcccCCCCCC 0 37 0 0 0 32 0 0
ggGGgGGgggGGGGggggGgggggGGGgGGgGggGggGGGGGGGgGgGGgGGGgggGGGGGGg 0 0 35 0 0 0 28 0
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0

If I replaced line 2 of the above example with an empty line, I still got a + but the following lines were then correct:

ccCCcc$c$cCC$CC$ccccCc$CCCccccCcccccCCCcCCcCccCccCCCCCCCcCcCCcCCCcccCCCCCC 0 37 0 0 0 32 0 0 +
ggGGgGGgggGGGGggggGgggggGGGgGGgGggGggGGGGGGGgGgGGgGGGgggGGGGGGg 0 0 35 0 0 0 28 0 +
 +
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 -
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 +
,,.$.,..,....,.G,g,T.gG..Gg.,,,........,g,.....,g...,^F. 0 0 3 1 0 0 5 0 +
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 -
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +
...,.,,,..,c,...$..,,...,.cgA,,.GG,...........,,,,.G,,.Ng,,.G.,. 1 0 4 0 0 2 2 0 +
$ cat dna.awk 
old0!=$0 { old0=$0; i=2 }

i<=5 {
    while (!($i || $(i+4)) && i<=5)
        i++
    if (i<=5) {
        print $0, ($i>$(i+4) ? "+" : "-")
        i++
        next
    }
}

i==6 && /\.|,/ {
    print $0, (split($0, a, /\./) > split($0, a, /,/) ? "+" : "-")
}

$ cat data
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0
ccCCcc$c$cCC$CC$ccccCc$CCCccccCcccccCCCcCCcCccCccCCCCCCCcCcCCcCCCcccCCCCCC 0 37 0 0 0 32 0 0
ggGGgGGgggGGGGggggGgggggGGGgGGgGggGggGGGGGGGgGgGGgGGGgggGGGGGGg 0 0 35 0 0 0 28 0
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0

$ awk -f dna.awk data
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0 +
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0 +
.......GGGG,.G,,G...G.,.T...G.,..,.,,^F, 0 0 8 1 0 0 0 0 +
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
,gc,,cga,g,c,,,,,,, 0 0 0 0 1 3 3 0 -
ccCCcc$c$cCC$CC$ccccCc$CCCccccCcccccCCCcCCcCccCccCCCCCCCcCcCCcCCCcccCCCCCC 0 37 0 0 0 32 0 0 +
ggGGgGGgggGGGGggggGgggggGGGgGGgGggGggGGGGGGGgGgGGgGGGgggGGGGGGg 0 0 35 0 0 0 28 0 +
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0 +
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0 -
.....,,..,,...,,......,...cA.c,cC. 1 1 0 0 0 3 0 0 +

Hi Alister,

Thanks for the modified script. The errors are not corrected but some of the lines got deleted. When I ran it on my data of about 4.5million lines, nearly 3000 got deleted. However, I managed to correct the bug by removing the back slash in the last 2 lines and everything looks fine now:

i==6 && /\.|,/ {
    print $0, (split($0, a, /\./) > split($0, a, /,/) ? "+" : "-")

to:

i==6 && /.|,/ {
    print $0, (split($0, a, /./) > split($0, a, /,/) ? "+" : "-")

Once again, thank you for your help. Took me nearly a week to get this done.