As you can see the data has various segments based on column 1. I use the following code to compute the mean of each segment and output the value of column 1 for that segment and the mean of the values of column 2 and some other things just so I can check am doing the right thing.
awk '{if($1<0)$1=0}
{
sum[$1]+=$2
cnt[$1]++
}
END {
# print "Name" "\t" "sum" "\t" "cnt" "\t" "avg"
for (i in sum)
printf "%8.5f %6.2f %6d %6.3f\n", i, sum, cnt, sum/cnt
}' data.txt | sort -n -k1 > avgFile.txt
Unfortunately as you can see, my data has outliers in these segments. I need to remove these outliers before I compute the mean so that they don't mess up my results. I am using awk to process my data.
This is what I have been able to do so far, if I get one segment to a file say temp.txt I am able to use the following code to remove the outlier in that segment
Yes please, In this case, in the little code, for each segment, I am removing the rows where the second column element has more than 30% divergence from the average, then I consider such to be an outlier.
This can be done with associative arrays in awk i.e. if you are familiar with them and if you're not then I'd suggest reading up on them...
---------- Post updated at 01:12 PM ---------- Previous update was at 12:22 PM ----------
Here is how you'd go about eliminating outliers from your data in order to compute the mean...
awk '{
cnt[$1]++
val[$1] = (val[$1] ? val[$1] "," $2 : $2)
sum[$1] += $2
} END {
for (i in val) {
n = split(val, a, ",")
for (k=1; k<=n; k++)
if (!((sqrt((a[k] - (sum/cnt))^2)) < ((sum / cnt) * (30/100)))) {
cnt--
sum -= val
}
}
for (i in sum)
printf "%8.5f %6.2f %6d %6.3f\n", i, sum, cnt, sum / cnt | "sort -nk1"
}' data.txt
Thank you for your suggestion, but interestingly it suggestion for a small file, when I run this on a big file it complains about attempted division by 0. Let me attach a large file and you see what I am talking about. However I am very grateful for your response, it gives me a starting point.
Shamrock please help me learn something. What does this line
val[$1] = (val[$1] ? val[$1] "," $2 : $2)
do exactly in this code! Am sure once I understand this, I would be able to see where the problem could be
---------- Post updated at 05:58 AM ---------- Previous update was at 03:04 AM ----------
Okay I understand that this creates a string with the values in column 2 separated by a comma. so that it is latter split
---------- Post updated at 06:06 AM ---------- Previous update was at 05:58 AM ----------
and the problem I see here is if indeed val is a string created from the values of column 2each value separated by comma, how does the following part of the code work?
sum -= val
sum is a number and val is a string? Sorry for so many questions I am new to awk and I really want to learn it.
Thank you Shamrock, It works right now. This is what I needed and your advice to read up truely sent me reading a lot about arrays in awk and I must say that I have learned a lot from this.