awk - calculation of probability density

Hi all!

I have the following problem: I would like to calculate using awk a probability of appearing of a pair of numbers x and y. In other words how frequently do these numbers appear?

In the case of only one integer number x ranged for example from 1 to 100 awk one liner has the form:

awk 'BEGIN{for(i=1;i<=100;i++) h=0}{h[$1]+=1}END{for(i=1;i<=100;i++) print i, h/NR}' datafile

where datafile contains the number x:

#x
2
65
100
...

My question is how to extend above awk one-liner for a pair of number x and y? In this case datafiles looks as follows:

#x   #y
23     15
35     1
23     15
...
```[/i]


Thanks in advance.

something like this:

#  cat infile
23 15
35 1
23 15

#  awk '{h[$1" "$2]++}END{for (i in h){print i,h/NR}}' infile
35 1 0.333333
23 15 0.666667

HTH

1 Like

How to extend your one-liner to the case where non-integer numbers are present in the infile?

I was trying with this:

awk '{h[int($1/10)" "int($2/10)]++}END{for (i in h){print i*10,h[i]/NR}}' infile

but it does not work.

It would work as is, if you want to group by values like 2.31313 and 2.31314, which may not be very useful - depends on the analysis you need to do. Otherwise you want to truncate decimals e.g., 2.31313 -> 2.31

awk '{h[sprintf("%.2f",$1) " " sprintf("%.2f",$2)]++}END{for (i in h){print i,h/NR}}' infile

sprintf("%f.2", number) rounds a real to 2 decimals.

How to change this formula in a such a way that it will return not only propabilty - h [i]but also the pair of number for which h [i]correspond ?