Shortest path for each query from a csv file

rushadrena · November 25, 2013, 3:02am

Hi all, I have this file

__DATA__
 child,    Parent,    probability,
 M7,    Q,    P,
 M7,      M28,     E,
 M28,     M6,      E,
 M6,      Q,      Pl,
 & several hundred lines.....

Legends: P(= Probable) > Pl(=Plausible) > E(=Equivocal). What I want is for each child I want to trace it back to Q(Query), but I need the shortest path which leads it to the Q(Query) along with their probabilities. For example for the input data shown above the output should be :-

__OUTPUT_1__
 M7:  M7<-Q = P
 M28: M28<-M6<-Q = Pl.E
 M6:  M6<-Q = Pl

But as we can see from second row of input data M7 has another longer path tracing to Q : M7<-M28<-M6<-Q = Pl.E.E. But the code should have an option to neglect the largest path and thus show only the shortest path OR to show all of them. i.e.

__OUTPUT_2__
 M7:  M7<-Q = P
 M7:  M7<-M28<-M6<-Q = Pl.E.E
 M28: M28<-M6<-Q = Pl.E
 M6:  M6<-Q = Pl

Thus this second output prints a path tracing back to Q for each of the rows, so if we have N input rows to the program , we will have N corresponding output rows.

The code that I have is not working on the data. please help

#! /usr/bin/perl
 my %DEF = (
 I   => [qw( P Pl P.P P.Pl Pl.P Pl.Pl P.P.P P.P.Pl P.Pl.P P.Pl.Pl Pl.P.P Pl.P.Pl Pl.Pl.P Pl.Pl.Pl )],
 II  => [qw( E P.E Pl.E P.P.E P.Pl.E Pl.P.E Pl.Pl.E )],
 III => [qw( E.P E.Pl P.E.P P.E.Pl Pl.E.P Pl.E.Pl E.P.P E.P.Pl E.Pl.P E.Pl.Pl )],
 IV  => [qw( E.E P.E.E Pl.E.E E.P.E E.Pl.E E.E.P E.E.Pl E.E.E )] );
 my @rank = map @$_, @DEF{qw(I II III IV)};
 my %rank = map {$rank[$_-1] => $_} 1..@rank;
 my @group = map {($_) x @{$DEF{$_}}} qw(I II III IV);
 my %group = map {$rank[$_-1] => $group[$_-1]."_".$_} 1..@group;
 sub rank { $rank{$a->[2]} <=> $rank{$b->[2]} }
 my %T;

 sub oh { map values %$_, @_ }
 sub ab {   my ($b, $a) = @_;   [$b->[0], $a->[1], qq($a->[2].$b->[2]), qq($b->[3]<-$a->[3])] 
}
 sub xtend {
 my $a = shift;
 map {ab $_, $a} oh @{$T{$a->[0]}}{@_} }
 sub ins { $T{$_[3] //= $_[1]}{$_[2]}{$_[0]} = \@_ }

 ins split /,\s*/ for <DATA>;
 ins @$_ for map {xtend $_, qw(P E Pl)} (oh oh oh \%T);
 ins @$_ for map {xtend $_, qw(P E Pl)} (oh oh oh \%T);

 for (sort {rank} grep {$_->[1] eq 'Q'} (oh oh oh \%T)) {
 printf "%-4s: %20s,  %-8s %6s\n",
     $_->[0], qq($_->[0]<-$_->[3]), $_->[2], $group{$_->[2]};
 }  

 __DATA__
 M7    Q    P
 M54    M7    Pl
 M213    M54    E
 M206    M54    E
 M194    M54    E
 M53    M7    Pl
 M186    M53    Pl
 M194    M53    Pl
 M187    M53    E
 M204    M53    E
 M201    M53    E
 M202    M53    E
 M179    M53    E
 M173    M53    E
 M205    M53    E
 M195    M53    E
 M196    M53    E
 M197    M53    E
 M198    M53    E
 M57    M7    E
 M44    M7    E
 M61    M7    E
 M13    M7    E
 M50    M7    E
 M158    M50    P
 M157    M50    P
 M153    M50    Pl
 M162    M50    E
 M164    M50    E
 M165    M50    E
 M147    M50    E
 M159    M50    E

Chubler_XL · November 25, 2013, 4:38pm

If your happy with an awk solution you could try:

awk -F', *' -v shortest=1 '
function follow(i,j,v,b,r)
{
  b=999
  if(N == "Q") return C"<-Q = "P;

  for(j=split(U[N],V);j;j--) {
     v=C "<-" follow(V[j]) "." P
     if(split(v,H,".") < b) {
        r=v
        b=split(v,H,".")
     }
  }
  return r;
}
{ U[$1] = (U[$1]?U[$1] " ":"")NR
  C[NR]=$1
  N[NR]=$2
  P[NR]=$3
}

END {
   for(i=1;i<=NR;i++) {
       if (shortest) {
          v=follow(i)
          if(!(C in A) || split(v,H,".") < split(A[C],H,"."))
             A[C]=v
       } else printf "%s: %s\n",C,follow(i)
    }
    if (shortest)
       for(i=1;i<=NR;i++) {
             if(C in A) {
                 printf "%s: %s\n",C,A[C];
                 delete A[C];
            }
        }
}' infile

Change red value above to shortest=0 for full list.

Output(2) for your data:

M7: M7<-Q = P
M54: M54<-M7<-Q = P.Pl
M213: M213<-M54<-M7<-Q = P.Pl.E
M206: M206<-M54<-M7<-Q = P.Pl.E
M194: M194<-M54<-M7<-Q = P.Pl.E
M53: M53<-M7<-Q = P.Pl
M186: M186<-M53<-M7<-Q = P.Pl.Pl
M187: M187<-M53<-M7<-Q = P.Pl.E
M204: M204<-M53<-M7<-Q = P.Pl.E
M201: M201<-M53<-M7<-Q = P.Pl.E
M202: M202<-M53<-M7<-Q = P.Pl.E
M179: M179<-M53<-M7<-Q = P.Pl.E
M173: M173<-M53<-M7<-Q = P.Pl.E
M205: M205<-M53<-M7<-Q = P.Pl.E
M195: M195<-M53<-M7<-Q = P.Pl.E
M196: M196<-M53<-M7<-Q = P.Pl.E
M197: M197<-M53<-M7<-Q = P.Pl.E
M198: M198<-M53<-M7<-Q = P.Pl.E
M57: M57<-M7<-Q = P.E
M44: M44<-M7<-Q = P.E
M61: M61<-M7<-Q = P.E
M13: M13<-M7<-Q = P.E
M50: M50<-M7<-Q = P.E
M158: M158<-M50<-M7<-Q = P.E.P
M157: M157<-M50<-M7<-Q = P.E.P
M153: M153<-M50<-M7<-Q = P.E.Pl
M162: M162<-M50<-M7<-Q = P.E.E
M164: M164<-M50<-M7<-Q = P.E.E
M165: M165<-M50<-M7<-Q = P.E.E
M147: M147<-M50<-M7<-Q = P.E.E
M159: M159<-M50<-M7<-Q = P.E.E

rushadrena · November 25, 2013, 11:12pm

Thanks Chubler,
But how do I run this script of yours. I mean how to give it a file from command line to process on. Could you please suggest.

I tried this , but am getting errors:-

awk -f check_26nov_pred.awk input.csv
awk: check_26nov_pred.awk:1: awk -F', *' -v shortest=1 '
awk: check_26nov_pred.awk:1:       ^ invalid char ''' in expression

Chubler_XL · November 25, 2013, 11:17pm

If you save the file as a .awk you don't need the first line and the last line should be replaced with }

Call it like this:

$ awk -F ', *' -v shortest=1 -f check_26nov_pred.awk input.csv

rushadrena · November 25, 2013, 11:54pm

I modified as you suggested :-
So your code now starts with

function follow(i,j,v,b,r)
{
  b=999
  if(N == "Q") return C"<-Q = "P;
.........

And end with

    if (shortest)
       for(i=1;i<=NR;i++) {
             if(C in A) {
                 printf "%s: %s\n",C,A[C];
                 delete A[C];
            }
        }
}

When I run it on sample file, Sample_input.csv as
awk -F ', *' -v shortest=0 -f check_26nov_pred.awk prediction/Sample_input.csv :-

M7,    Q,    P,
M54,    M7,    Pl,
M213,    M54,    E,
M206,    M54,    E,
M194,    M54,    E,
M53,    M7,    Pl,
M186,    M53,    Pl,
M194,    M53,    Pl,
M187,    M53,    E,
M204,    M53,    E,
M201,    M53,    E,
M202,    M53,    E,
M179,    M53,    E,
M13,    M53,    E,
M157,    M53,    E,
M173,    M53,    E,
M205,    M53,    E,
M195,    M53,    E,

I get empty fields as output :-

M7:
M54:
M213:
M206:
M194:
M53:
M186:
M194:
M187:
M204:
M201:
M202:
M179:
M13:
M157:
M173:
M205:
M195:

Chubler_XL · November 26, 2013, 3:58pm

Seems to be working fine for me. If you are running this on a Solaris/SunOS system, change awk to /usr/xpg4bin/awk , /usr/xpg6/bin/awk , or nawk.

Edit: I think those are tab characters after the comma in your data file, running the script like this:

$ awk -F ',\\s*' -v shortest=0 -f check_26nov_pred.awk prediction/Sample_input.csv