Reformat text table

yifangt · January 9, 2011, 12:33am

Hello,

I have a challenge here to reformat a text table. The original table follows:

Item01: m1, m2, m3: A; m4, m5, m6: B; m7, m8: C; m9 m10: D
Item02: m1, m9, m10: A; m7, m5, m6: C; m2, m3, m4, m8: D
Item03: m1, m6, m7: A; m2: B; m3, m4: C; m5 m8 m9 m10: D
.
.
.

Please note:
1) m1, m2, ~ m10 show up each row, always!
2) m1~m10 may be in random order;
3) A, B, C, D may NOT always show up in each row;

I want the table reformatted to:

        m1   m2   m3   m4   m5   m6   m7   m8   m9   m10
Item01  A    A    A    B    B    B    C    C    D    D
Item02  A    D    D    D    C    C    C    D    A    A
Item03  A    B    C    C    D    A    A    D    D    D

.
.
.

which mean with a header as the first row (m1 ~ m10) and the Item-- as the first column. The reformatted one is a two dimension structure that is much easier to look at.

I have been struggling with it using PERL by myself for a long time. Could not figure it out. Any help is highly appreciated. Thanks in advance!

Yifangt

Scrutinizer · January 9, 2011, 5:05am

Try this:

awk -F'[ \t:;,]*' '{split($0,T); for(i=NF;i>=2;i--)if (T~/m[0-9]/){sub(/m/,x,T);$(T+1)=c}else c=T; NF=11}1 ' OFS="\t" infile

Item01  A       A       A       B       B       B       C       C       D       D
Item02  A       D       D       D       C       C       C       D       A       A
Item03  A       B       C       C       D       A       A       D       D       D

yifangt · January 9, 2011, 9:17am

Thanks Scrutinizer!

This is amazing and too complicated to me. Is it possible for you to explain it to me, as I can only catch part of your code?

Actually my data is much bigger than the sample and I ignored the header row and some of the columns. I thought of using perl to parse it, and combined each row with the same SNP name in one row.

1) Each row start with the SNP name that can be repeated for 4 times at most (they are neighbour rows). Some only once. The output is a combined single row for all the same SNP;
2) If the 1st column is the same then the 2nd, 4th and 5th are the same (for same SNP), which means the same SNP in different rows. This is the most different part from my first post;
3) There are 96 variants for each SNP. The variant not listed for a specific SNP indicates the SNP is missing for it and should be labeled as - or NA for consistency of the output format;

Sorry for not put the raw data first as I was trying perl script by using hash and I am a geneticist fond of programming. Anyway, thank you if you can have a look at this again.

SNP-name    chromosome-polymorphic-sequence-Species-variants    Locus-(if mapped-to-locus)    Chromosomal-map-location
BKN000000001    1    C    RRS-7;RRS-10;Knox-10;Knox-18;Rmx-A02;Rmx-A180;Pna-17;Pna-10;Eden-1;Eden-2;Lov-1;Lov-5;Fab-2;Fab-4;Bil-5;Bil-7;Var2-1;Var2-6;Spr1-2;Spr1-6;Omo2-1;Omo2-3;Ull2-5;Ull2-3;Zdr-1;Zdr-6;Bor-1;Bor-4;Pu2-7;Pu2-23;Lp2-2;Lp2-6;HR5;HR-10;NFA-8;NFA-10;Sq-1;Sq-8;CIBC5;CIBC17;Tamm-2;Tamm-27;KZ9;Goettingen-7;Goettingen-22;Rennes-1;Rennes-11;Uod-1;Uod-7;Cvi-0;Lz-0;Ei-2;Gu-0;Ler-1;Nd-1;C24;CS22491;Wei-0;Ws-0;Yo-0;Col-0;An-1;Br-0;Est-1;Ag-0;Gy-0;Ra-0;Bay-0;Ga-0;Mrk-0;Mz-0;Wt-5;Kas-1;Ct-1;Mr-0;Tsu-1;Mt-0;Nok-3;Wa-1;Fei-0;Se-0;Ts-1;Ts-5;Pro-0;Ll-0;Kondara;Shahdara;Sorbo;Kin-0;Ms-0;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112482
BKN000000001    1    T    KZ1    AT1G01280    112482
BKN000000002    1    G    RRS-7;RRS-10;Knox-10;Knox-18;Rmx-A02;Rmx-A180;Pna-17;Pna-10;Eden-1;Eden-2;Lov-1;Lov-5;Fab-2;Fab-4;Bil-5;Bil-7;Var2-1;Var2-6;Spr1-2;Spr1-6;Omo2-1;Omo2-3;Ull2-5;Ull2-3;Zdr-1;Zdr-6;Bor-1;Bor-4;Pu2-7;Pu2-23;Lp2-2;Lp2-6;HR5;HR-10;NFA-8;NFA-10;Sq-1;Sq-8;CIBC5;CIBC17;Tamm-2;Tamm-27;KZ1;KZ9;Goettingen-7;Goettingen-22;Rennes-1;Rennes-11;Uod-1;Uod-7;Cvi-0;Lz-0;Ei-2;Gu-0;Ler-1;Nd-1;C24;CS22491;Wei-0;Ws-0;Yo-0;Col-0;An-1;Br-0;Est-1;Ag-0;Gy-0;Ra-0;Bay-0;Ga-0;Mrk-0;Mz-0;Wt-5;Kas-1;Ct-1;Mr-0;Tsu-1;Mt-0;Nok-3;Wa-1;Fei-0;Se-0;Ts-1;Ts-5;Pro-0;Ll-0;Shahdara;Kin-0;Ms-0;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112561
BKN000000002    1    A    Kondara;Sorbo    AT1G01280    112561
BKN000000003    1    A    RRS-7;RRS-10;Knox-10;Knox-18;Rmx-A02;Rmx-A180;Pna-10;Eden-1;Eden-2;Lov-1;Lov-5;Fab-2;Fab-4;Bil-5;Bil-7;Var2-1;Var2-6;Spr1-2;Spr1-6;Omo2-1;Omo2-3;Ull2-5;Ull2-3;Zdr-1;Zdr-6;Bor-1;Bor-4;Pu2-7;Pu2-23;Lp2-2;Lp2-6;Sq-8;CIBC5;CIBC17;Tamm-2;Tamm-27;KZ1;KZ9;Goettingen-7;Goettingen-22;Uod-1;Uod-7;Cvi-0;Ei-2;Gu-0;Ler-1;Nd-1;C24;CS22491;Wei-0;Ws-0;Yo-0;Col-0;An-1;Est-1;Gy-0;Ra-0;Bay-0;Ga-0;Mrk-0;Wt-5;Kas-1;Ct-1;Mr-0;Tsu-1;Mt-0;Nok-3;Wa-1;Se-0;Ts-1;Ts-5;Pro-0;Ll-0;Kondara;Shahdara;Sorbo;Kin-0;Ms-0;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112771
BKN000000003    1    G    Pna-17;HR5;HR-10;NFA-8;NFA-10;Sq-1;Rennes-1;Rennes-11;Lz-0;Br-0;Ag-0;Mz-0;Fei-0    AT1G01280    112771
.
.
.

Thanks again!

Yifangt

m.d.ludwig · January 9, 2011, 10:51am

In your data sample:

yifangt:

SNP-name    chromosome-polymorphic-sequence-Species-variants    Locus-(if mapped-to-locus)    Chromosomal-map-location
BKN000000001    1    C    RRS-7;RRS-10;...;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112482
BKN000000001    1    T    KZ1    AT1G01280    112482
BKN000000002    1    G    RRS-7;RRS-10;...;Oy-0;Ws-2    AT1G01280    112561
BKN000000002    1    A    Kondara;Sorbo    AT1G01280    112561
BKN000000003    1    A    RRS-7;RRS-10;...;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112771
BKN000000003    1    G    Pna-17;HR5;HR-10;...;Mz-0;Fei-0    AT1G01280    112771
.
.
.

There are six columns of data but four column headers.
Are the first three data columns the "SNP-Name"?
And the last two the "value", the 'A', 'B', 'C', 'D' in your example?

---------- Post updated at 10:51 AM ---------- Previous update was at 10:35 AM ----------

My initial implementation to generate a CSV file:

use strict;
use warnings;

$\ = "\n";
$, = '';

my %H;
my %D;

<>; # toss the header

while (<>) {
    chomp;
    my ($snpname, $snpidx, $acgt, $chomosomelist, $locus, $location) = split;

    unless (defined $location) {
        print STDERR $ARGV, '(', $., '): malformed entry - ', $_;
        next;
    }

#> adjust these as required to get a proper "label" and "value"

    $snpname .= '-' . $snpidx;
    $locus   .= '(' . $location . ')';

    foreach my $c (split /;/, $chomosomelist) {
        $H{$c}++;
        $D{$snpname}->{$c} = $locus;
    }
}

sub csv {
    local $, = ',';
    print map { defined $_ ? '"' . $_ . '"' : '"NA"' } @_;
}

my @H = sort keys %H;

csv '', @H;

foreach my $snpname (sort keys %D) {
    my $X = $D{$snpname};
    csv $snpname, map { $X->{$_} } @H;
}

yifangt · January 9, 2011, 3:38pm

mixed up reply, deleted!

Chubler_XL · January 9, 2011, 7:00pm

A lot of the elegance of Scrutinizer's solution was that your variants were numbered in the test data with the column number required (and each data set was on 1 line).

I've created a solution that works with your modified spec, and for simplicity I've put the column variant names in an external file (var), 1 per line:

$ head var
Ag-0
An-1
Bay-0
Bil-5
Bil-7
Bor-1
Bor-4
Br-0
Bur-0
C24

$ cat process_snp 
awk '
BEGIN { printf "SNP chromosome" }
NR == FNR { vars[$0]=NR-1; var_cnt=NR; printf " "$0; next }
FNR == 1 { printf " Locus location\n"; next}
SNP && SNP != $1 {
  printf "%s %s",SNP,ch;
  for(i=0;i<var_cnt;i++) {
     printf " "(T=="" ? "-" : T);
     T="";
  }
  printf " %s %s\n", lo, ln
}
{
  SNP=$1; ch=$2; lo=$5; ln=$6
  split($4,V,";");
  for(var in V) T[vars[V[var]]]=$3;
}
END {
  printf "%s %s",SNP,ch;
  for(i=0;i<var_cnt;i++) printf " "(T=="" ? "-" : T)
  printf " %s %s\n", lo, ln
}' var datafile

Output is a follows:

$ ./process_snp
SNP chromosome Ag-0 An-1 Bay-0 Bil-5 Bil-7 Bor-1 Bor-4 Br-0 Bur-0 C24 CIBC17 CIBC5 CS22491 Col-0 Ct-1 Cvi-0 Eden-1 Eden-2 Edi-0 Ei-2 Est-1 Fab-2 Fab-4 Fei-0 Ga-0 Goettingen-22 Goettingen-7 Gu-0 Gy-0 HR-10 HR5 KZ1 KZ9 Kas-1 Kin-0 Knox-10 Knox-18 Kondara Ler-1 Ll-0 Lov-1 Lov-5 Lp2-2 Lp2-6 Lz-0 Mr-0 Mrk-0 Ms-0 Mt-0 Mz-0 NFA-10 NFA-8 Nd-1 Nok-3 Omo2-1 Omo2-3 Oy-0 Pna-10 Pna-17 Pro-0 Pu2-23 Pu2-7 RRS-10 RRS-7 Ra-0 Rennes-1 Rennes-11 Rmx-A02 Rmx-A180 Se-0 Shahdara Sorbo Spr1-2 Spr1-6 Sq-1 Sq-8 Tamm-2 Tamm-27 Ts-1 Ts-5 Tsu-1 Ull2-3 Ull2-5 Uod-1 Uod-7 Var2-1 Var2-6 Wa-1 Wei-0 Ws-0 Ws-2 Wt-5 Yo-0 Zdr-1 Zdr-6 Locus location
BKN000000001 1 C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C T C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C AT1G01280 112482
BKN000000002 1 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G A G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G A G G G G G G G G G G G G G G G G G G G G G G G AT1G01280 112561
BKN000000003 1 G A A A A A A G A A A A A A A A A A A A A A A G A A A A A G G A A A A A A A A A A A A A G A A A A G G G A A A A A A G A A A A A A G G A A A A A A A G A A A A A A A A A A A A A A A A A A A A AT1G01280 112771