the smallest number from 90% of highest numbers from all numbers in file

Apfik · May 21, 2011, 5:05pm

Hello All,
I am having problem to find what is the smallest number from 90% of highest numbers from all numbers in file. I am having file with thousands of lines and hundreds of columns.
I am familiar mainly with bash but I am open to whatever suggestion witch will lead to the solutions.

If I explain it differently I have fx 1000 numbers between 0 and 10000. The results could be:

90% of numbers are bigger than 1000
80% of numbers are bigger than 2342
70% of numbers are bigger than 5674
etc.

I am looking for numbers like 1000, 2342, 5674 as in this example.

I am sure that there is some statistical method how to do this, but I cannot remember and can find it how it is called. If I know what method can be used to do this I may find the way to calculate it too.

Thank you for help

Chirel · May 21, 2011, 6:09pm

Hi,

Could you please give us an input file and desired output example ?

Apfik · May 21, 2011, 6:55pm

Hi

INPUT can looks like this, but much bigger (in columns and rows), the numbers are not sorted in any way (it may looks like that here however)

0.35156582 0.36767924 0.40942771 1.15580244 1.20877668

1.21842761 1.27427217 1.41896056 1.16207427 1.21533599

1.41774799 1.22634608 1.28255355 1.42818227 1.19181428

2.08513847 1.78348512 1.86522813 2.07701713 1.78747556

OUTPUT
here is 20numbers, fx I would like to have 5bands. Each band will have 20% of numbers, meaning

100% of numbers is bigger then 0 or seeking number
80% of numbers is bigger then (seeking number)
60% of numbers is bigger then (seeking number)
40% of numbers is bigger then (seeking number)
20% of numbers is bigger then (seeking number)

I hope that it it is more clear now.
I am slowlly find it way around, but it is not that much elegant and I am creating lots of rubbish around. The have to do this for tens of files with 50000numbers in each file. That reason why I am looking for elegant and quick solution.

Thank you

Perderabo · May 21, 2011, 7:56pm

I don't see a quick solution. You need to put the numbers in a list, sort them, count them, then see what is at each 10% of the list.

ananthap · May 22, 2011, 8:16am

For a real quick solution, I would
(1) Put the data one on a line.
(2) Sort.
(3) Pass it to 'awk' with the required percentile value as a parameter.
(4) Use pattern $1 < parameter.
(5) For each record make it the minimum if needed.
(6) On END, print the value.

bartus11 · May 22, 2011, 10:53am

Try this script:

#!/usr/bin/perl
open I, "$ARGV[0]";
while (<I>){
  chomp;
  push @x, split / /, $_;
}
@x=sort {$a<=>$b} @x;
for ($i=0;$i<=$#x;$i+=($#x+1)/5){
  printf "%d%s of numbers is bigger than %s\n", 100-$i/($#x+1)*100,"%",$x[$i];
}

Run it like this: ./script.pl data_file

Perderabo · May 22, 2011, 11:29am

The OP does not know what the limits are... he or she needs to find them. Consider:

1 2 3 4 7 8 9
1 2 3 6 7 8 9

Now find the middle point. It the first list 4 is the mid point. But in the second list its 6. You don't know 4 or 6 ahead of time. The mid point is the 50% point. Now image a much longer list and you need to find the data element at 10%, 20%, 30%...90% points in the list.

bartus11 · May 22, 2011, 12:00pm

perderabo:

The OP does not know what the limits are... he or she needs to find them. Consider:
1 2 3 4 7 8 9
1 2 3 6 7 8 9
Now find the middle point. It the first list 4 is the mid point. But in the second list its 6. You don't know 4 or 6 ahead of time. The mid point is the 50% point. Now image a much longer list and you need to find the data element at 10%, 20%, 30%...90% points in the list.

I hope it is reply to ananthap's post, cause if it is directed for me, then I completely don't get it

Perderabo · May 22, 2011, 4:38pm

I apologise, bartus11. I'm just learning perl and I did not realise how close yours comes to being correct. Running it though, it does seem to need a little work.

$ ./sc nums
100% of numbers is bigger than 0.04942771
80% of numbers is bigger than 1.15580244
60% of numbers is bigger than 1.22634608
40% of numbers is bigger than 1.42818227
20% of numbers is bigger than 3.07701713
$
$
$
$
$ grep 0.04942771 nums
0.31556582 0.36677924 0.04942771 1.15580244 1.02877668
$

100% of the numbers are larger than 0.04942771; however one of the numbers is 0.04942771. I have to say that raises a flag. Maybe you have an off by 1 situation?

bartus11 · May 22, 2011, 4:45pm

Well, maybe I should explain how I understood OP request: get all the numbers into one big set, and in this set find numbers from which 100%, 80%, etc of remaining numbers are greater. So it is not calculating the percentages by lines. It reads all of the number in all of the lines first, and then does the calculation.

vgersh99 · May 22, 2011, 5:23pm

I think the OP is after calculating the percentiles.

ananthap · May 22, 2011, 10:54pm

The OP did say 1000 numbers. Anyway wc or $NR in awk will give the count in the file..