Hello all,
I have a data file that needs some serious work...I have no idea how to implement the changes that are needed!
The file is a genotypic file with >64,000 columns representing genetic markers, a header line, and >1100 rows that looks like this:
ID 1 2 3 4 ........ 64,000
AX65 AA CT TT CC ........ AT
DF00 AG CC AT CG ........ AA
HJ34 00 TT TT GG ........ AA
KL98 AA CC AA CG ......... 00
SE00 GG CT 00 GG ......... TT
The whole idea is to get each marker (column) recoded as either -10, 0 or 10 with the missing values (00) recoded as the average of each column. This will need to be accomplished in several steps.
*First, I need to recode the missing values that are currently coded as "00" to something else such as a "." HOWEVER I do not want anything in the ID column (first column) to be recoded.
*Second, I need to recode each column as -10, 0, or 10 depending on the alphabetical order. For example, in columns that contain AA, AG, and GG these will be recoded as -10, 0, and 10, respectively. Likewise, columns that contain CC, CG, and GG will be -10, 0, and 10 respectively.
**** There are several combinations of genotypes:
AA, AC, CC
AA, AG, GG
AA, AT, TT
CC, CG, GG
CC, CT, TT
GG, GT, TT
*Finally, I need to calculate the average of each marker (each column) and replace the missing values "." with this average value which will be different for every column
I am so sorry to have such a long grocery list of changes to implement, but like I said I have no idea how to do any of this...any help you can provide with any of these steps would be greatly appreciated!!
Thank you in advance,
Doob