Awk based script to find the median of all individual columns in a data file

ks_reddy · June 8, 2012, 9:49am

Hi All,

I have some data like below.

Step1,Param1,Param2,Param3
1,2,3,4
2,3,4,5
2,4,5,6
3,0,1,2
3,0,0,0
3,2,1,3

........

so on

Where I need to find the median(arithmetic) of each column from Param1...to..Param3 for each set of Step1 values.
(Sort each specific column, if the total entries is even; my mean is sum of (Middle 2 values)/2 else the mean is the middle value itself.)

I got one script like below by searching in google

echo -e '6\n2\n4\n3\n1' | sort -n | awk '{arr[NR]=$1}
   END { if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}'

But this works on a single column of data.

I want someone to modify this or suggest some other code to make it work on all columns.

I want something like below

awk -F, -v VAR="$i" '$1==VAR{Some Condition}' input.csv

where I am scanning variable i (each unique value from column1) using for loop.

The Expected output is :

Median_of_Col2,Median_of_Col3,Median_of_Col4 etc.. (practically i have many columns in my input)

The output with the above input is

2,3,4
3.5,4.5,5.5
0,1,2

Thanks
Sidda

binlib · June 8, 2012, 12:34pm

With gawk version 4:

gawk4 -F, '{
  for (i = 1; i <= NF; ++i)
    a[NR] = $i
}
END {
  for (i = 1; i <= NF; ++i) {
    asort(a)
    printf(" %.1f", NR%2? a[(NR+1)/2] : (a[NR/2] + a[NR/2+1])/2)
  }
  print ""
}'

Corona688 · June 8, 2012, 12:59pm

The data must be put into order some way in order to get the median, and the sort command can only change the order of lines, it can't swap bits of lines around for you.

awk, unless you're in Linux, doesn't have an easy way to sort data inside itself.

Perhaps perl would be a better solution here.

$ cat data

Step1,Param1,Param2,Param3
1,2,3,4
2,3,4,5
2,4,5,6
3,0,1,2
3,0,0,0
3,2,1,3

$ cat median.pl

#!/usr/bin/perl

@arr=split(/,/, $header=<STDIN>);       # Count columns
my $dat=[], $row=0, $cols=scalar(@arr), $median, $prefix="";

while(@arr=split(/,/, <STDIN>))
{
        for($n=0; $n<=scalar(@arr); $n++)
        {       $dat[$n][$row]=$arr[$n];        }

        $row++;
}

# Sort each column separately, then get the median and print
for($n=0; $n<$cols; $n++)
{
        @l=sort {$a <=> $b} @{$dat[$n]};        # sort $dat[$n] into @l

        if(($row%2)==0) # Even number of rows
        {
                printf("%s%s", $prefix, ($l[($row/2)-1]+$l[($row/2)])/2);
        }
        else
        {       printf("%s%s", $prefix, $l[$row/2]);    }
        $prefix=",";
}

printf("\n");

exit 0;

$ ./median.pl < data

2.5,2,2,3.5

$

ks_reddy · June 8, 2012, 1:15pm

Hi binlib,

Thanks for your quick reply.
I am getting some syntax error in the code you code using gawk.
I am unable to figure it after few changes also.
Can you please try to look into it once more and correct the error.

Thanks
Sid

---------- Post updated at 10:45 PM ---------- Previous update was at 10:43 PM ----------

Hi Corona,

This perl script is not giving desired output.
I try to guess the median values how it calculated, but no clue.
What I need is for each unique value in Column1( Step) I need the median values for the remaining columns printed side by side.
I have given my required output in my post.
Can you look into your code again.

Thanks

Corona688 · June 8, 2012, 1:54pm

It works absolutely fine for me using the data you posted, as you can see from what I posted unless you disagree with the numbers it gives. You only gave the format of the data you wanted, you didn't show what values you should get from your input, so I've got no way to check the numbers until you do.

Please show the manner in which you used it, the manner in which it 'did not work', and an actual sample of the data it doesn't work with.

So duplicates do not count? That's different than just a median of values...

ks_reddy · June 9, 2012, 7:19am

Hi Corona,

Thanks for the reply.
For the input of
Step1,Param1,Param2,Param3 1,2,3,4 2,3,4,5 2,4,5,6 3,0,1,2 3,0,0,0 3,2,1,3
The output is computed like below.
In 1st column(Step1), for the data sample '1' the median for 2,3,4 columns are 2,3,4 itself (as the columns has only one value(odd number).
Similarly for the data sample '2' (repeated twice) the median is (3+4)/2, (4+5)/2, (5+6)/2 which is nothing but 3.5,4.5,5.5.
Similarly for the data sample '3' (repeated thrice) the median for 2,3,4 columns are 0,1,2 as the total number samples in each column(2,3,4) are 3. ( So when we sort each column the values are like this {0,0,2}, {0,1,1}, {0,2,3} so the median for each set is 0,1,2 respectively).
So the final output will be
2,3,4 # Median of columns 2,3,4 when 1st col rows=='1' 3.5,4.5,5.5 #Median of columns 2,3,4 when 1st col rows =='2' 0,1,2 #Median of columns 2,3,4 when 1st col rows=='3'
# So my result is nothing for median of columns 2,3,4 for each unique set of row values in column1.

I hope the median computation is clearly explained now.
Have a look and let me know.

Regards
Sid

---------- Post updated at 04:49 PM ---------- Previous update was at 05:38 AM ----------

Hi Binlib,

Your code is perfect except for two missing ';'s (semicolons),
one after the statement " a[i][NR] = $i "
and one more after the statement " asort(a[i]) "

Thanks a lot for your effort.

Sidda