Sorting a .csv using awk or other

dunnybocter · April 18, 2011, 3:03pm

Hello all, I am new here and *relatively* new to Unix. I have a bit of an emergency. I have a three column file that I need to sort:

sample name, miRNA, reads per million (RPM)

There are multiple samples, and for each sample name there are multiple miRNAs and associated RPMs. Some of these miRNAs overlap between samples, but not all, so here is what the input file looks like

Sample 1, mirna1,RPM
Sample 1, mirna3,RPM
Sample 1, mirna4, RPM
Sample 2, mirna2,RPM
Sample2, mirna3, RPM
Sample2, mirna4, RPM

I need an output file that looks like this
miRNA1 miRNA2 .... miRNAx
Sample 1 RPM RPM
Sample 2 RPM RPM
Sample 3 RPM RPM
etc

Basically, each row should be a sample and each column should have the RPM corresponding the the miRNA for that sample. I also need to put a zero if there is no value for an miRNA in a specific sample. The output should be csv as well.

Any ideas or help would be greatly appreciated. I am trying to use awk but I'm lost and very much in a rush. MOST SINCERE THANKS IN ADVANCE!

rbatte1 · April 19, 2011, 12:52pm

Hello dunnyboctor

You could try with this shell script, but it may be rather slow if you have very large files. It might be rather awkward if the format is a bit flexible. I will have a try with a simple one and we can vary it if the input really is as you describe:-

#!/bin/ksh
inputfile=/tmp/myfile.input
outputfile=/tmp/myfile.output

for sample in `sort -n +1 $inputfile|cut -f1 -d","`
do
   echo "\n${sample} \c"              # Start new output record
   for mirna in `grep "^${sample}," $inputfile | cut -f2 -d "," | sort -n`
   do
      line=`grep "^${sample}, ${mirna}," $inputfile"`
      rpm="${line##*,}"                  # Get last field based on comma separator
      echo "${mirna} \c"                  # Append but do not end output record
   done
done >> $outputfile

The other option, if you know the number of "sample"s and the number of "mirna"s then you write a loop similar to the following:-

#!/bin/ksh
inputfile=/tmp/myfile.input
outputfile=/tmp/myfile.output
typeset -i max_sample max_mirna RPM
read max_sample?"How many samples? "
read max_mirna?"How many mirna? "

{
echo "Sample, \c"
mirna=1
while [ $mirna -le $max_mirna ]
do
   echo "miRNA${mirna}, \c"           # Build up title line
   ((mirna=$mirna+1))
done

# Now process the data records
sample=1
while [ $sample -le $max_sample ]
do
   echo "\n${sample},\c"              # Start new line
   mirna=1
   while [ $mirna -le $max_mirna ]
   do
      line=`grep "^Sample${sample}, mirna${mirna}, " $inputfile`
      RPM="${line##*,}"               # Get last field only
      echo "${RPM},\c"
      ((mirna=$mirna+1))
   done
   ((sample=$sample+1))
done ; echo                           # Complete last record
} > $outputfile

Give those a try and see if that helps. They may be a bit flakey if the input format varies, so do let us know if I need to handle that a bit better.

Robin
Liverpool/Blackburn
UK

shamrock · April 19, 2011, 3:44pm

Give this awk script a go...

awk -F, '{
   v = $2
   gsub(/[aA-zZ ]/, "", v)
   if (!x[$1])
      x[$1] = $1
   n = split(x[$1], a, " ") 
   a[v+2] = $NF
   x[$1] = ""
   for (i=1; i<=v+2; i++)
      x[$1] = sprintf("%s%s ", x[$1] ? x[$1] : "", a ? a : 0)
} END {for (i in x) print x}' file

dunnybocter · April 19, 2011, 4:04pm

Thank you so much for your replies! I found temporary route using an alternate program (pedsys and then some additional work in excel) but I will definitely start with these ideas and see where I get because in the future we will definitely need a quicker solution than what I resorted to yesterday! Besides that, I need to practice and play to learn
Again, THANK YOU!!

shamrock · April 19, 2011, 4:18pm

Did you try the awk script yet...it does what you are looking for.

dunnybocter · April 19, 2011, 4:56pm

Not yet, but likely tomorrow. Since I am most comfortable with awk I will probably start there. Thx again!

shamrock · April 20, 2011, 12:29pm

I updated the awk script also so that it now outputs in csv format instead of ssv...so give it a try.

awk -F, '{
   v = $2
   gsub(/[aA-zZ ]/, "", v)
   if (!x[$1])
      x[$1] = $1
   n = split(x[$1], a, " ")
   a[v+2] = $NF+0
   x[$1] = ""
   v = (v < n ? n : v+2)
   for (i=1; i<=v; i++)
      x[$1] = sprintf("%s%s%s", x[$1] ? x[$1] : "", a ? a : 0, i < v ? " " : "")
} END {
   for (i in x) {
      gsub(" ", ",", x)
      sub(",", " ", x)
      print x
   }
}' file