Hello UNIX experts,
I'm stumped finding a method to reformat a column. Input file is a two column tab-delimited file. Essentially, for every term that appears in column 2, I would like to summarize whether that term appears for every entry in column 1. In other words, make a header for each term that appears in column 2, and then mark whether that appears for each entry in column 1. Somewhat complicated to explain, so I'll give an example:
$ cat input.txt
sample1 workflow1;workflow2;workflow3;workflow4;workflow5
sample2 workflow2;workflow3;workflow1;workflow5
sample3 workflow3;workflow1;workflow4;workflow5;workflow2
sample4 workflow3;workflow1;workflow4;workflow5;workflow2
sample5 workflow3;workflow1;workflow4;workflow5;workflow2
sample6 workflow1
sample7 workflow8
$ cat output.txt
Sample workflow1 workflow2 workflow3 workflow4 workflow5 workflow8
sample1 X X X X X NA
sample2 X X X NA X NA
sample3 X X X X X NA
sample4 X X X X X NA
sample5 X X X X X NA
sample6 X NA NA NA NA NA
sample7 NA NA NA NA NA X
# note: X is "yes, this workflow exists for this sample"
The values "workflow1", "workflow2" etc...may contain special characters such as underscores, hyphens, colons etc... (ie. workflow-1, work:flow1), so matches must be exact. I wrote a working bash code, but it is terribly slow and poorly written. Right now, a 2,000 line file takes 20 minutes because my code sucks and runs through loops! For reference, my strategy:
# for each sample and workflow in $file, check whether workflow exists by through grep, then output a file with X or NA.
for SAMPLE in `cat $file.samples`; # column 1 from $file
do
echo "Preparing $SAMPLE"
for WORKFLOW in `cat $file.workflows`; # all the unique terms from column 2
do
CHECK=`grep -w "$SAMPLE" $file | grep -oP ";${WORKFLOW};" | sed "s/\;//g"`;
if [ "${CHECK}" == "${WORKFLOW}" ]; then
echo "X" > $file.$SAMPLE.$WORKFLOW.reply; else echo "NA" >$file.$SAMPLE.$WORKFLOW.reply
fi
done;
echo $SAMPLE > $file.sample;
paste $file.sample $file.$SAMPLE.*.reply >> $file.sum.txt;
rm $file.sample $file.$SAMPLE.*.reply;
done
Many many thanks in advance!!
Torch