Hi all!
I have the following problem: I would like to calculate using awk a probability of appearing of a pair of numbers x and y. In other words how frequently do these numbers appear?
In the case of only one integer number x ranged for example from 1 to 100 awk one liner has the form:
awk 'BEGIN{for(i=1;i<=100;i++) h=0}{h[$1]+=1}END{for(i=1;i<=100;i++) print i, h/NR}' datafile
where datafile contains the number x:
#x
2
65
100
...
My question is how to extend above awk one-liner for a pair of number x and y? In this case datafiles looks as follows:
#x #y
23 15
35 1
23 15
...
```[/i]
Thanks in advance.
something like this:
# cat infile
23 15
35 1
23 15
# awk '{h[$1" "$2]++}END{for (i in h){print i,h/NR}}' infile
35 1 0.333333
23 15 0.666667
HTH
1 Like
How to extend your one-liner to the case where non-integer numbers are present in the infile?
I was trying with this:
awk '{h[int($1/10)" "int($2/10)]++}END{for (i in h){print i*10,h[i]/NR}}' infile
but it does not work.
It would work as is, if you want to group by values like 2.31313 and 2.31314, which may not be very useful - depends on the analysis you need to do. Otherwise you want to truncate decimals e.g., 2.31313 -> 2.31
awk '{h[sprintf("%.2f",$1) " " sprintf("%.2f",$2)]++}END{for (i in h){print i,h/NR}}' infile
sprintf("%f.2", number) rounds a real to 2 decimals.
How to change this formula in a such a way that it will return not only propabilty - h [i]but also the pair of number for which h [i]correspond ?