plz help me with this, I want to to extract the duplicate rows (column 1) in a file which at least repeat 4 times. then I want to summarize them by getting the max , mean, median and min. The file is sorted by column 1, all the repeated rows appear together.
If number of elements is odd, median is middle one , eg 4th element among 7 sorted numbers ... element number (n+1)/2
If number of elements is even, it is the average of middle 2, eg. average of 4th and 5th element for set of 8 sorted numbers...average of n/2 + 1 and n/2
This works good for all rows...but how do I print rows only which repeat at least 4 times?
I tried the following modification but it prints out gibberish..
sort file -k1,1 -k2,2n | awk ' {nbr[$1]++; a[$1]= a[$1] ? a[$1]"@"$2 : $2; sum[$1]+=$2} END { for (key in a) { split(a[key], b, "@") len = length(b) for (i=1;i<=len;i++) { avg = sum[key] / nbr[key]; if (nbr[key]%2) { median = b[(nbr[key]+1)/2] } else { median = (b[(nbr[key]/2)+1] + b[nbr[key]/2])/2 } }
if (len >3) {
printf "%s %s %s %s %s\n", key, b[len], avg, median, b[1]
}
} } '
Also my original files are quite large..eg 500mb each, each there a way to speed this up? Right now it takes forever to run
---------- Post updated at 11:54 AM ---------- Previous update was at 11:10 AM ----------
Update..this seems to run fine... but if anything can be done to speed up..please let me know..
sort testmed.txt -k1,1 -k2,2n | awk '
{nbr[$1]++; a[$1]= a[$1] ? a[$1]"@"$2 : $2; sum[$1]+=$2}
END {
for (key in a) {
split(a[key], b, "@")
len = length(b)
for (i=1;i<=len;i++) {
avg = sum[key] / nbr[key];
if (nbr[key]%2) {
median = b[(nbr[key]+1)/2]
} else {
median = (b[(nbr[key]/2)+1] + b[nbr[key]/2])/2
}
}
if ( len > 3)
{
printf "%s %s %s %s %s\n", key, b[len], avg, median, b[1]
}
}
}
'