Compute average ignoring outliers of different segments within a dat file using awk

malandisa · September 18, 2014, 11:54am

I have data files that look like this, say data.txt

0.00833 6.34
0.00833 6.95
0.00833 7.08
0.00833 8.07
0.00833 8.12
0.00833 8.26
0.00833 8.70
0.00833 9.36
0.01667 20.53
0.01667 6.35
0.01667 6.94
0.01667 7.07
0.01667 8.06
0.01667 8.10
0.01667 8.25
0.01667 8.71
0.01667 9.31
0.02500 20.19
0.02500 6.35
0.02500 6.92
0.02500 7.07
0.02500 8.08
0.02500 8.09
0.02500 8.24
0.02500 8.70
0.02500 9.26
0.03333 19.89
0.03333 6.33
0.03333 6.90
0.03333 7.07
0.03333 8.07
0.03333 8.09
0.03333 8.22
0.03333 8.70
0.03333 9.22
0.04167 19.65
0.04167 6.34
0.04167 6.87
0.04167 7.07
0.04167 8.03
0.04167 8.08
0.04167 8.19
0.04167 8.69
0.04167 9.19

As you can see the data has various segments based on column 1. I use the following code to compute the mean of each segment and output the value of column 1 for that segment and the mean of the values of column 2 and some other things just so I can check am doing the right thing.

awk '{if($1<0)$1=0}
{
    sum[$1]+=$2
    cnt[$1]++
}
END {
#     print "Name" "\t" "sum" "\t" "cnt" "\t" "avg"
    for (i in sum)
        printf "%8.5f   %6.2f   %6d   %6.3f\n", i, sum, cnt, sum/cnt

}' data.txt  | sort -n -k1 > avgFile.txt

Unfortunately as you can see, my data has outliers in these segments. I need to remove these outliers before I compute the mean so that they don't mess up my results. I am using awk to process my data.

This is what I have been able to do so far, if I get one segment to a file say temp.txt I am able to use the following code to remove the outlier in that segment

awk 'BEGIN{CNT=0} {ROW[CNT]=$0;DATA[CNT]=$2; 
    TOTAL+=$2;CNT+=1;} END{for (i = 0;i < NR; i++){if ((sqrt((DATA-(TOTAL/NR))^2))<((TOTAL/NR)*30/100)) 
    {print ROW ;}}}' temp.txt

But I need to able to do this within the code that computes the average so that my mean value excludes this outlier.

Any assistance will be highly appreciated.

Malandisa

jim_mcnamara · September 18, 2014, 12:00pm

Do you mean use standard deviation to identify "outliers"? That is usually the accepted approach - 3 stddev from the mean.

malandisa · September 18, 2014, 12:08pm

Yes please, In this case, in the little code, for each segment, I am removing the rows where the second column element has more than 30% divergence from the average, then I consider such to be an outlier.

shamrock · September 18, 2014, 2:12pm

This can be done with associative arrays in awk i.e. if you are familiar with them and if you're not then I'd suggest reading up on them...

---------- Post updated at 01:12 PM ---------- Previous update was at 12:22 PM ----------

Here is how you'd go about eliminating outliers from your data in order to compute the mean...

awk '{
    cnt[$1]++
    val[$1] = (val[$1] ? val[$1] "," $2 : $2)
    sum[$1] += $2
} END {
    for (i in val) {
        n = split(val, a, ",")
        for (k=1; k<=n; k++)
            if (!((sqrt((a[k] - (sum/cnt))^2)) < ((sum / cnt) * (30/100)))) {
                cnt--
                sum -= val
            }
    }
    for (i in sum)
        printf "%8.5f   %6.2f   %6d   %6.3f\n", i, sum, cnt, sum / cnt | "sort -nk1"
}' data.txt

malandisa · September 19, 2014, 2:02am

Thank you for your suggestion, but interestingly it suggestion for a small file, when I run this on a big file it complains about attempted division by 0. Let me attach a large file and you see what I am talking about. However I am very grateful for your response, it gives me a starting point.

malandisa · September 19, 2014, 6:06am

Shamrock please help me learn something. What does this line

val[$1] = (val[$1] ? val[$1] "," $2 : $2)

do exactly in this code! Am sure once I understand this, I would be able to see where the problem could be

---------- Post updated at 05:58 AM ---------- Previous update was at 03:04 AM ----------

Okay I understand that this creates a string with the values in column 2 separated by a comma. so that it is latter split

---------- Post updated at 06:06 AM ---------- Previous update was at 05:58 AM ----------

and the problem I see here is if indeed val is a string created from the values of column 2each value separated by comma, how does the following part of the code work?

sum -= val

sum is a number and val is a string? Sorry for so many questions I am new to awk and I really want to learn it.

shamrock · September 19, 2014, 4:33pm

malandisa:

and the problem I see here is if indeed val is a string created from the values of column 2each value separated by comma, how does the following part of the code work?
sum -= val
sum is a number and val is a string? Sorry for so many questions I am new to awk and I really want to learn it.

Good catch...it should be sum -= a[k] and I did it run the modified code on "temp.txt" and it gave me no such errors...

malandisa · September 19, 2014, 5:58pm

Thank you Shamrock, It works right now. This is what I needed and your advice to read up truely sent me reading a lot about arrays in awk and I must say that I have learned a lot from this.

Thanks!