Help reformatting column

torchij · February 5, 2017, 2:30pm

Hello UNIX experts,

I'm stumped finding a method to reformat a column. Input file is a two column tab-delimited file. Essentially, for every term that appears in column 2, I would like to summarize whether that term appears for every entry in column 1. In other words, make a header for each term that appears in column 2, and then mark whether that appears for each entry in column 1. Somewhat complicated to explain, so I'll give an example:

$ cat input.txt
sample1	workflow1;workflow2;workflow3;workflow4;workflow5
sample2	workflow2;workflow3;workflow1;workflow5
sample3	workflow3;workflow1;workflow4;workflow5;workflow2
sample4	workflow3;workflow1;workflow4;workflow5;workflow2
sample5	workflow3;workflow1;workflow4;workflow5;workflow2
sample6	workflow1
sample7	workflow8

$ cat output.txt
Sample	workflow1	workflow2	workflow3	workflow4	workflow5	workflow8	
sample1	X	X	X	X	X	NA
sample2	X	X	X	NA	X	NA
sample3	X	X	X	X	X	NA
sample4	X	X	X	X	X	NA
sample5	X	X	X	X	X	NA
sample6	X	NA	NA	NA	NA	NA
sample7	NA	NA	NA	NA	NA	X

# note: X is "yes, this workflow exists for this sample"

The values "workflow1", "workflow2" etc...may contain special characters such as underscores, hyphens, colons etc... (ie. workflow-1, work:flow1), so matches must be exact. I wrote a working bash code, but it is terribly slow and poorly written. Right now, a 2,000 line file takes 20 minutes because my code sucks and runs through loops! For reference, my strategy:

# for each sample and workflow in $file, check whether workflow exists by through grep, then output a file with X or NA.
for SAMPLE in `cat $file.samples`; # column 1 from $file
 do
  echo "Preparing $SAMPLE"
  for WORKFLOW in `cat $file.workflows`; # all the unique terms from column 2
   do
    CHECK=`grep -w "$SAMPLE" $file | grep -oP ";${WORKFLOW};" | sed "s/\;//g"`;
    if [ "${CHECK}" == "${WORKFLOW}" ]; then
     echo "X" > $file.$SAMPLE.$WORKFLOW.reply; else echo "NA" >$file.$SAMPLE.$WORKFLOW.reply
    fi
   done;
  echo $SAMPLE > $file.sample;
  paste $file.sample $file.$SAMPLE.*.reply >> $file.sum.txt;
  rm $file.sample $file.$SAMPLE.*.reply;
 done

Many many thanks in advance!!

Torch

Chubler_XL · February 5, 2017, 3:11pm

Try this awk script:

awk '
{
  values=split($2,V,";")
  sample[NR]=$1
  for(i=1;i<=values;i++) {
      if(!(V in CI)) {
         CI[V]
         CH[++col]=V
      }
      CNT[NR,V]
  }
}
END {
   printf "Sample"
   for(i=1;i in CH;i++) printf "\t%s", CH
   for(ln=1;ln<=NR;ln++) {
      printf "\n%s",sample[ln]
      for(i=1;i in CH;i++)
         printf "\t%s", ln SUBSEP CH in CNT ? "X" : "NA"
   }
   printf "\n"
}' input.txt

torchij · February 5, 2017, 3:29pm

chubler_xl:

Try this awk script:

awk '
{
  values=split($2,V,";")
  sample[NR]=$1
  for(i=1;i<=values;i++) {
   if(!(V in CI)) {
   CI[V]
   CH[++col]=V
   }
   CNT[NR,V]
  }
}
END {
   printf "Sample"
   for(i=1;i in CH;i++) printf "\t%s", CH
   for(ln=1;ln<=NR;ln++) {
   printf "\n%s",sample[ln]
   for(i=1;i in CH;i++)
   printf "\t%s", ln SUBSEP CH in CNT ? "X" : "NA"
   }
   printf "\n"
}' input.txt

A beautiful solution! Appears to work as expected, and is ~2000x faster then grep'ing to disk a million times.

Thanks very much!