[SOLVED] Converting data from one format to the other

newbie83 · December 11, 2012, 1:21pm

Hi All,

I need to convert an exel spreadsheet into a SAS dataset,
and the following format change is needed. Please help, this is too complex
for a biologist.

Let me describe the input.

1st row is generation.1st column in keyword 'generation', starting 2nd column there are 5 generations in my actual data,
I have shown 3 here, namely G1,G2 and PAR.

2nd row is family name,1st column is keyword 'gene', from 2nd column format is NameParent1-NameParent2-RandomNumber_Replicate.
If 'Parent' is present in the name, then format is NameParent-Parent-RandomNumber_Replicate and it should be included in the output.
The characters upto the second '-' is the family name, replicate (can be 1,2 or 3) is the number after ''.
Random number (between second '-' and '') can be ignored.

Starting 3rd row are gene names(column 1) and values(column 2 through 9).
There are 30000 genes in my actual data set.

The gene values need to repeat according to nested format shown below,
Only if the generation is PAR (stands for Parent), replace the generation 'PAR' by name of the parent
specified by the characters upto first '-'. For example if P1-Parent-9_1 is the column name and the
corresponding generation is PAR, the output should have 'P1' instead of 'PAR' in the generations column.

Sample input

generation G1    G1    G1    G2    G2    G2    PAR    PAR    PAR    PAR    PAR    PAR
gene    P1-P2-24_3    P1-H-56_2    P1-P2-84_1    P1-P2-34_3    P1-P2-33_1    P1-P2-99_2 P1-Parent-9_1 P1-Parent-43_2 P1-Parent-45_3 P2-Parent-62_1 P2-Parent-43_2 P2-Parent-11_3
gene1    0    0    0    0    0    0    0    0    0    3    3    7
gene2    2    1    1    2    6    4    6    7    8    67    6    66

Expected Output

family    rep    generation    gene     value        
P1-P2    3    G1        gene1    0
P1-P2    2    G1        gene1    0    
P1-P2    1    G1        gene1    0
P1-P2    3    G2        gene1    0
P1-P2    1    G2        gene1    0    
P1-P2    2    G2        gene1     0
P1-P2    1    P1        gene1    0
P1-P2    2    P1        gene1     0
P1-P2    3    P1        gene1    0
P1-P2    1    P2        gene1    3
P1-P2    2    P2        gene1    3
P1-P2    3    P2        gene1    7
P1-P2    3    G1        gene2    2
P1-P2    2    G1        gene2    1    
P1-P2    1    G1        gene2    1
P1-P2    3    G2        gene2    2
P1-P2    1    G2        gene2    6    
P1-P2    2    G2        gene2     4
P1-P2    1    P1        gene2     6
P1-P2    2    P1        gene2     7
P1-P2    3    P1        gene2    8
P1-P2    1    P2        gene2    67
P1-P2    2    P2        gene2    6
P1-P2    3    P2        gene2    66

Thanks

Corona688 · December 11, 2012, 1:23pm

Is this excel spreadsheet actually an excel spreadsheet, or a csv, or a flatfile?

If a flatfile, what separator does it use?

newbie83 · December 11, 2012, 1:24pm

I have excel spreadsheet but I can convert that into flat text file with tabbed separators.

Thanks

Corona688 · December 11, 2012, 1:40pm

Partly working, but one thing I don't understand, where do you get P1-P2 from P2-Parent-11_3 ?

newbie83 · December 11, 2012, 1:49pm

Just because both P1-Parent and P2-Parent belongs to the P1-P2 family, maybe the keyword 'P2' from P2-Parent-11_3 can be matched with 'P2' in 'P1-P2' ??

I have a different file for each family, so if you force both P1 and P2 to be a part of the P1-P2 family, that will be fine for now.

Corona688 · December 11, 2012, 1:52pm

How do they belong to the P1-P2 family?

rdrtx1 · December 11, 2012, 1:53pm

try:

awk '
BEGIN {
  OFS="\t";
  print "family","rep","generation","gene","value";
}
/generation/ {for (i=2; i<=NF; i++) gn=$i; next;}
$1 == "gene" {fm=$2; sub("-[^-]*$","",fm);
  par=fm;
  sub("-.*","", par);
  for (i=2; i<=NF; i++) {
    if ($i ~ /Par/) {gn=$i; sub("-.*","",gn);}
    sub(".*_","",$i);rp=$i
  }; next;
}
{ for (i=2; i<=NF; i++ ) {
   print fm, rp, gn, $1, $i;
  }
}
' OFS="\t" input

newbie83 · December 11, 2012, 2:05pm

Hi Corona688,

The only way to think of it is 'P1-Parent' is a part of the 'P1-P2' family because a family name 'P1-P2' is defined by two parents P1 and P2. So all the generations will have
'P1-P2' as family name if P1 and P2 are the parents. If P1 and P3 are parents, then family name is 'P1-P3'.

One way to do this will be to extract the characters from a G1 generation family name , for example extract P1-P2 from P1-P2-24_3 and hard-code into a particular file.
Since I have only 1 family per file, this will work fine.

Corona688 · December 11, 2012, 2:10pm

I remain confused, but rdtx's code seems to be working simpler and better than my code in any case.

newbie83 · December 11, 2012, 2:36pm

this works absolutely fine and quick. many thanks rdrtx1 and Corona688 .