recoding data points using SED??

doobedoo · October 9, 2009, 11:00am

Hello all,
I have a data file that needs some serious work...I have no idea how to implement the changes that are needed!

The file is a genotypic file with >64,000 columns representing genetic markers, a header line, and >1100 rows that looks like this:

ID       1     2     3     4    ........  64,000
AX65   AA   CT   TT   CC  ........    AT
DF00   AG   CC   AT   CG  ........    AA
HJ34   00    TT   TT   GG  ........    AA
KL98   AA    CC   AA   CG .........    00
SE00   GG    CT   00   GG .........    TT

The whole idea is to get each marker (column) recoded as either -10, 0 or 10 with the missing values (00) recoded as the average of each column. This will need to be accomplished in several steps.

*First, I need to recode the missing values that are currently coded as "00" to something else such as a "." HOWEVER I do not want anything in the ID column (first column) to be recoded.
*Second, I need to recode each column as -10, 0, or 10 depending on the alphabetical order. For example, in columns that contain AA, AG, and GG these will be recoded as -10, 0, and 10, respectively. Likewise, columns that contain CC, CG, and GG will be -10, 0, and 10 respectively.
**** There are several combinations of genotypes:

                AA, AC, CC
                AA, AG, GG
                AA, AT, TT
                CC, CG, GG
                CC, CT, TT
                GG, GT, TT

*Finally, I need to calculate the average of each marker (each column) and replace the missing values "." with this average value which will be different for every column

I am so sorry to have such a long grocery list of changes to implement, but like I said I have no idea how to do any of this...any help you can provide with any of these steps would be greatly appreciated!!
Thank you in advance,
Doob

steadyonabix · October 9, 2009, 2:11pm

I have read this post a number of times and have to say I am confused.

It would be better if you put a representative file together containing all input variations you need to account for.

Then put a file together showing the desired output for that input file.

Clearly define any conventions you use and try to avoid explaining how it will be achieved, just explain what you want.

Good luck

doobedoo · October 9, 2009, 3:56pm

I apologize for the confusion! I understand where I need to go with this but I have no clue how to tell the computer to do it so it is hard for me to explain it to others as well...let me try again...

My input file currently looks like this:

ID        1   2   3   4   5   6   7   8

83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA 00 AT GT 00 AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 00 AA TT GG 00 GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA 00 GG CC AG GG CT

I want to rename the missing values so they are just a period and save an output file like this:

ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA . AT GT . AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 . AA TT GG . GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA . GG CC AG GG CT

Then I need to create an output file that has all of the letters recoded as -1, 0, or 1. This should be done in alphabetical order and on a per column basis so that:

ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 -1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 1 0 0 0 0 0 0
83847118 . -1 1 -1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . -1 -1 0 1 0

Finally I need to calculate the average of each column and replace the missing values from that column with the average:

ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 -1 -1 0 -1 1
83847041 -1 -0.5 0 0 -0.5 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 1 0 0 0 0 0 0
83847118 -0.25 -1 1 -1 -0.5 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 0.5 -1 -1 0 1 0

This will be the final file. Does this make more since or have I confused you more??

Thanks

steadyonabix · October 10, 2009, 3:39am

Thanks, that's much better but still a little confusion over the conversion of letters to numbers.

In the first post you say you want to convert letter pairs to -10 0 or 10 but in your second -1 0 or 1.

Can you put a table of values together that shows the correspondence between letter pairs and numbers so we are clear on what you want to convert?

Cheers

doobedoo · October 10, 2009, 12:58pm

Hello again,
Again, I apologize for the confsion. I made a mistake in the first post, the letters should be recoded to -1, 0, 1. This is the tricky part. I need to recode the letters on a per column, alphabetical order basis. There are several different combinations that can occur within a column:
AA, AC, CC = -1, 0, 1
AA, AG, GG = -1, 0, 1
AA, AT, TT = -1, 0, 1
CC, CG, GG = -1, 0, 1
CC, CT, TT = -1, 0, 1
GG, GT, TT = -1, 0, 1

Therefore anything with a mixed data point (AC, AG, AT, CG, CT, GT) will ALWAYS = 0, AA will ALWAYS = -1, and TT will ALWAYS = 1. The problem come when recoding CC and GG. As you can see, in some rows CC will come first in the alphabet and will be recoded as -1 (When the combo is CC, CG, GG) . However, in some columns CC does not come first in the alphabet and will be coded as 1 (when the combo is AA, AC, CC). The same problem occurs with GG. IS there any solution to this issue? I hope I explained it better this time!!

Thank you so much for your patience!!

danmero · October 10, 2009, 1:42pm

doobedoo:

ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA . AT GT . AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 . AA TT GG . GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA . GG CC AG GG CT

Then I need to create an output file that has all of the letters recoded as -1, 0, or 1. This should be done in alphabetical order and on a per column basis so that:

ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 -1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 1 0 0 0 0 0 0
83847118 . -1 1 -1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . -1 -1 0 1 0

Your example is not consistent.

awk '{gsub(/00/,".");gsub(/A[CGT]|C[GT]|GT/,"0");gsub(/AA/,"-1");gsub(/TT/,"1")}1' file
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 CC -1 CC CC
838469. -1 -1 1 GG CC 0 CC 1
83847041 -1 . 0 0 . 0 0 0
83847.4 0 -1 1 1 CC 0 0 0
83847085 0 CC 0 0 0 0 0 0
83847118 . -1 1 GG . GG CC 0
83847162 GG -1 1 0 0 0 0 0
83847165 -1 -1 . GG CC 0 GG 0

Now try to solve/elaborate on CC & GG problem.

doobedoo · October 11, 2009, 7:39pm

Ok great. Thank you so much! Now the problem with the GG and CC is that in either case they can be a -1 or a 1, depending on what has already been recoded. If a GG is in a column that already contains 1's then GG must = -1. If the GG is in a column that already contains -1's, then the GG must be a 1. This is also true for the CC columns. I have a total of >64,000 columns so I can not go through and list which column is which. Any suggestions?

steadyonabix · October 12, 2009, 2:34pm

Then rather than list all possible permutations manually you need to reduce it to an algorithm. That is what after all a developer would do.