Writing an algorithm to recode data points

I have a file that has been partially recoded so that data points that were formerly letter combinations are now -1, 0, or 1. I need to finish recoding the GG and CC data points. The file looks like this:

ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 CC -1 CC CC
838469. -1 -1 1 GG CC 0 CC 1
83847041 -1 . 0 0 . 0 0 0
83847.4 0 -1 1 1 CC 0 0 0
83847085 0 CC 0 0 0 0 0 0
83847118 . -1 1 GG . GG CC 0
83847162 GG -1 1 0 0 0 0 0
83847165 -1 -1 . GG CC 0 GG 0

The problem with the GG and CC is that in either case they can be a -1 or a 1, depending on what has already been recoded. If a GG is in a column that already contains 1's then GG must = -1. If the GG is in a column that already contains -1's, then the GG must be a 1. This is also true for the CC columns. I have a total of >64,000 columns so I can not go through and list which column is which. It has been suggested that I need to write an algorithm to do this but I am not very familiar with programming. Can anyone help me?
Thanks!

so... in your example, for just GG...
column 1 GG would be 1
column 4 GG would be -1
column 6 GG would be 1
but...
a) column 7 GG would be... what ? there is no 1 or -1 ?
b) column 5 CC does not have a 1 or -1, either?
c) are you garuanteed to have at least one 1 or -1 in each column in the entire file?
d) are the GG's to be recoded first, and then the CC's recoded based on the recoded GG's?

I think I need some more information.

I've added your file below, padded out with spaces, just to make the columns more readable for discussion:

ID         1   2   3   4   5   6   7   8
83845676   0   0   0   0  CC  -1  CC  CC
838469.   -1  -1   1  GG  CC   0  CC   1
83847041  -1   .   0   0   .   0   0   0
83847.4    0  -1   1   1  CC   0   0   0
83847085   0  CC   0   0   0   0   0   0
83847118   .  -1   1  GG   .  GG  CC   0
83847162  GG  -1   1   0   0   0   0   0
83847165  -1  -1   .  GG  CC   0  GG   0


c) are you garuanteed to have at least one 1 or -1 in each column in the entire file?

Also e) Do you guarantee that no column contains both -1 and 1?

Yes so far you are correct on the GG assignments. So to answer your questions I think it may help more if I answer them out of order:

c) are you garuanteed to have at least one 1 or -1 in each column in the entire file? NO. Some columns may have a combination of -1's, 0's, and 1's and some columns may have only two of these. I do not believe there are any columns that will be all 1's, all -1's, or all 0's

d) are the GG's to be recoded first, and then the CC's recoded based on the recoded GG's? Perhaps it would be easiest to recode the CC's first and then recode the GG's based on how the CC's are coded. The only times CC's will = 1 are in the columns that already have -1 in them. All other times CC will = -1. When CC and GG are in a column together (as in column 7), CC = -1 and GG = 1.

a) column 7 GG would be... what ? there is no 1 or -1 ?
b) column 5 CC does not have a 1 or -1, either?
These are so confusing because I provided a small, bad example! I apologize!! My real file has 1079 rows and 64,000 columns.... so, if there is a column with only 00's and GG's, GG's will =1. If there is a column with only 00's (or missing values) and CC's, CC's will = -1

I hope this helps! Please let me know if you have any other questions!
Thanks!!

Hello Sdanner and Steadyonabix,
I have tried to come up with an easier way to describe what I need to do with my file. If you recall I have a file that looks like this:
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 CC -1 CC CC
838469. -1 -1 1 GG CC 0 CC 1
83847041 -1 . 0 0 . 0 0 0
83847.4 0 -1 1 1 CC 0 0 0
83847085 0 CC 0 0 0 0 0 0
83847118 . -1 1 GG . GG CC 0
83847162 GG -1 1 0 0 0 0 0
83847165 -1 -1 . GG CC 0 GG 0

I have to get everything recoded for a specific program that only recognizes -1, 0, and 1 but the CC and GG cause an issue because they can take on multiple values. Perhaps calculating the minimum and maximum of each column may be easier (?)....

When min = -1 and max = 0, then both CC and GG = 1;
When min = 0 and max = 1, then both CC and GG = 1;
When both the min and max = 0, then CC = -1 and GG = 1;

I hope this clears things up a bit! Any help you could provide would be GREATLY appreciated!

Thanks,
Doob

I think I get what you mean: - min = the lowest value in the column max = the highest value in the column but to prove it please post the expected output for the table using the new algorithm.
Please use the correctly formated table posted by sdanner and put it between code tags.Thanks

Here is the original file formatted by Sdanner:
Code:
ID 1 2 3 4 5 6 7 883845676 0 0 0 0 CC -1 CC CC83846900 -1 -1 1 GG CC 0 CC 183847041 -1 . 0 0 . 0 0 083847004 0 -1 1 1 CC 0 0 083847085 0 CC 0 0 0 0 0 083847118 . -1 1 GG . GG CC 083847162 GG -1 1 0 0 0 0 083847165 -1 -1 . GG CC 0 GG 0

And here is the final output I would expect based on the minimum/maximum criteria:
Code:
ID 1 2 3 4 5 6 7 883845676 0 0 0 0 -1 -1 -1 -183846900 -1 -1 1 -1 -1 0 -1 183847041 -1 . 0 0 . 0 0 083847004 0 -1 1 1 -1 0 0 083847085 0 1 0 0 0 0 0 083847118 . -1 1 -1 . 1 -1 083847162 1 -1 1 0 0 0 0 083847165 -1 -1 . -1 -1 0 1 0
Thanks!
Doob

---------- Post updated at 10:54 AM ---------- Previous update was at 10:43 AM ----------

I am not quite sure why my tables did not upload correctly so I'll try to write it out for you...
Original input:
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 CC -1 CC CC
83846900 -1 -1 1 GG CC 0 CC 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 CC 0 0 0
83847085 0 CC 0 0 0 0 0 0
83847118 . -1 1 GG . GG CC 0
83847162 GG -1 1 0 0 0 0 0
83847165 -1 -1 . GG CC 0 GG 0

Output needed:
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 -1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 1 0 0 0 0 0 0
83847118 . -1 1 -1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . -1 -1 0 1 0

I am sorry for the poor formatting, but I could not get this uploaded correctly!

You have to

  1. Read the Forum Rules
  2. Learn howto use [code] tags.
  3. Edit your post and add [code] tags to preserve data formatting.

Here is the original file:

ID         1   2   3   4   5   6   7   8
83845676   0   0   0   0  CC  -1  CC  CC
83846900  -1  -1   1  GG  CC   0  CC   1
83847041  -1   .   0   0   .   0   0   0
83847004   0  -1   1   1  CC   0   0   0
83847085   0  CC   0   0   0   0   0   0
83847118   .  -1   1  GG   .  GG  CC   0
83847162  GG  -1   1   0   0   0   0   0
83847165  -1  -1   .  GG  CC   0  GG   0

Here is the output I need:

ID         1   2   3   4   5   6   7   8
83845676   0   0   0   0  -1  -1  -1  -1
83846900  -1  -1   1  -1  -1   0  -1   1
83847041  -1   .   0   0   .   0   0   0
83847004   0  -1   1   1  -1   0   0   0
83847085   0   1   0   0   0   0   0   0
83847118   .  -1   1  -1   .   1  -1   0
83847162   1  -1   1   0   0   0   0   0
83847165  -1  -1   .  -1  -1   0   1   0

Thanks!

deleted.

I don't have time to test this thoroughly so try and let me know of any bugs: -

CODE

nawk '
    ( FNR == 1 ){ 
        f++ 
        header = $0
        next 
    }
    ( f == 2 ){ printf("%s\n", header) ; f++ }
    ## Now we process each record for CC GG etc and apply our rules to them
    ( f == 3 ) {
        for( fi = 2; fi <= NF; fi++ ){
            gsub(/00/, ".", $fi)
            gsub(/A[CGT]|C[GT]|GT/, "0", $fi)
            gsub(/AA/, "-1", $fi)
            gsub(/TT/, "1", $fi)
            ## When min = -1 and max = 0, then both CC and GG = 1;
            ## When min = 0 and max = 1, then both CC and GG = 1;
            ## When both the min and max = 0, then CC = -1 and GG = 1;
            ## When min = -1 and max = 1 NO RULE DEFINED
            if( $fi == "CC" || $fi == "GG" ){
                if( cls[fn, 0] ) { min = 0 ; max = 0 }
                if( cls[fn, -1] )
                    min = -1
                if( cls[fn, 1] )
                    max = 1

                if( ( min ==  0 ) && ( max == 0 ) ){
                    if( $fi == "CC" )
                        $fi = -1
                    else
                        $fi = 1
                }
                if( ( min == -1 ) && ( max == 0 ) )
                    $fi = 1
                if( ( min ==  0 ) && ( max == 1 ) )
                    $fi = 1
            }
        }
        print $0
    }

    ( f == 1 ){     ## First pass of file
        for( i = 2; i <= NF; i++ ){
            cls[NF, $i]++
        }
    }
' infile infile

INPUT
I have gone back to the original input file as the one you list has "00" and "1" changed to "." in the first column.

cat infile
ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA 00 AT GT 00 AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 00 AA TT GG 00 GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA 00 GG CC AG GG CT

OUTPUT

ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 -1 0 0 0 0 0 0
83847118 . -1 1 1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . 1 -1 0 1 0

PS To enter code between code tags highlight the code and then click on the # symbol on the toolbar just above the text box.

Good luck

---------- Post updated at 03:51 PM ---------- Previous update was at 07:17 AM ----------

I have had time to look at this in a little more detail and can see it needed a fix.
I still can't get the output you require but am unsure if this is because your example output is flawed or not so I need you to take a look at the output and see if it is wrong or not.
I wrote the code to do the processing you want but have tried to add in danmero's code without really understanding if it does what you want or not.

Here is the code with the fix: -

nawk '
    ( FNR == 1 ){ 
        f++ 
        header = $0
        next 
    }
    ( f == 2 ){ printf("%s\n", header) ; f++ }
    ## Now we process each record for CC GG etc and apply our rules to them
    ( f == 3 ) {
                tmp = $1
                gsub(/00/, ".")
                gsub(/A[CGT]|C[GT]|GT/, "0")
                gsub(/AA/, "-1")
                gsub(/TT/, "1")
                $1 = tmp
        for( fi = 2; fi <= NF; fi++ ){
            ## When min = -1 and max = 0, then both CC and GG = 1;
            ## When min = 0 and max = 1, then both CC and GG = 1;
            ## When both the min and max = 0, then CC = -1 and GG = 1;
            ## When min = -1 and max = 1 NO RULE DEFINED
            if( $fi == "CC" || $fi == "GG" ){
                if( cls[fn, 0] ) { min = 0 ; max = 0 }
                if( cls[fn, -1] )
                    min = -1
                if( cls[fn, 1] )
                    max = 1

                if( ( min ==  0 ) && ( max == 0 ) ){
                    if( $fi == "CC" )
                        $fi = -1
                    else
                        $fi = 1
                }
                if( ( min == -1 ) && ( max == 0 ) )
                    $fi = 1
                if( ( min ==  0 ) && ( max == 1 ) )
                    $fi = 1
            }
        }
        print $0
    }

    ( f == 1 ){     ## First pass of file
        for( i = 2; i <= NF; i++ ){
            cls[NF, $i]++
        }
    }
' infile infile

Here is the input file: -

ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA 00 AT GT 00 AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 00 AA TT GG 00 GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA 00 GG CC AG GG CT

Here is the output: -

ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 -1 0 0 0 0 0 0
83847118 . -1 1 1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . 1 -1 0 1 0

The code I filched off danmero was based on your earlier spec: -

Hello again,
Again, I apologize for the confsion. I made a mistake in the first post, the letters should be recoded to -1, 0, 1. 
This is the tricky part. I need to recode the letters on a per column, alphabetical order basis. 
There are several different combinations that can occur within a column:
AA, AC, CC = -1, 0, 1
AA, AG, GG = -1, 0, 1
AA, AT, TT = -1, 0, 1
CC, CG, GG = -1, 0, 1
CC, CT, TT = -1, 0, 1
GG, GT, TT = -1, 0, 1
 
Therefore anything with a mixed data point (AC, AG, AT, CG, CT, GT) will ALWAYS = 0, AA will ALWAYS = -1, and TT will ALWAYS = 1. 
The problem come when recoding CC and GG. As you can see, in some rows CC will come first in the alphabet and will be recoded as -1 
(When the combo is CC, CG, GG) . However, in some columns CC does not come first in the alphabet and will be coded as 1 (when the combo is AA, AC, CC). 
The same problem occurs with GG. IS there any solution to this issue? I hope I explained it better this time!!

I don't understand this, you start by talking of columns and end talking of rows so I am just assuming danmero understood you and posted code that did what you want.

Let me know if this output is correct or not.

Cheers