Get column average using ID

jwbucha · September 10, 2013, 12:13pm

I have a file that looks like this:

id window BV
1 1 0.5
1 2 0.2
1 3 0.1
2 1 0.5
2 2 0.1
2 3 0.2
3 1 0.4
3 2 0.6
3 3 0.8

Using awk, how would I get the average BV for window 1? Output like this:

window avgBV
1 0.47
2 0.23
3 0.37

Scrutinizer · September 10, 2013, 12:30pm

How does your output correspond to your input?

jwbucha · September 10, 2013, 12:32pm

The output is showing the average BV for window 1, 2 etc.

Scrutinizer · September 10, 2013, 12:38pm

So the second column? Your suggested output does not seem to correspond to averages..
The average for window 1 is 0.466667 for example ( ( 0.5 + 0.5 + 0.4 ) / 3 )...

jwbucha · September 10, 2013, 12:40pm

fixed

Scrutinizer · September 10, 2013, 1:37pm

Not entirely. Is this homework?

jwbucha · September 10, 2013, 1:45pm

Not homework, but research problem. I am a PhD student in quantitative animal genomics. I am a geneticist, but working on learning unix programming. I have large output files (500 mb) that are analyzing effects from 50,000 DNA markers in 2000 animals. The program divides the genome into 1,000,000 base pair segments called a 'window'. I need to extract the breeding value (BV) for each window averaged across the 2000 animals. The table above is an example of what my output looks like. I apologize if I am not meeting the formatting requirements for this forum, it is my first post. I have been unable to find an awk solution anywhere else, hence my post.

edit: I should clarify that even though I am a student this is not for a class, but real data that is part of my research.

Scrutinizer · September 10, 2013, 2:03pm

OK, try something like:

awk 'NR==1; NR>1{A[$2]+=$3; C[$2]++} END{for (i in A) print i,A/C}' OFMT='%.2f'  file

jwbucha · September 10, 2013, 2:12pm

OK that worked beautifully. Thank you very much.

Now, I am legitimately working on learning to use awk on the fly without pouring over forums to get what I need. You seem to have mastered this already. Would you be willing to break down the code you just wrote and tell me what each piece is doing? This is not something I see done very often , and most of the resources on unix programming do not go into enough detail on what is actually being issued. I am assuming others would benefit from this as well. Thank you in advance if you are willing to do that.

Scrutinizer · September 10, 2013, 5:48pm

Sure, no problem:

awk '                
  NR==1                                # If it is the first line in the file it is the header, perform the default action, i.e. print the line { print $0 }
  NR>1{                                # For Any subsequent line, for every line :
    A[$2]+=$3                          # Create an element in associative array A with the second field as the index and the third field to its value..
    C[$2]++                            # Increase the counter in associative array C for field $2 
  } 
  END{                                 # When all lines are processed
    for (i in A) print i,A/C     # Run throught the array elements and print the index and the total value / divided by the number of occurences , using OFMT for the format.
  }
' OFMT='%.2f' file                    # set the format to 2 decimals and specify the file name.

jwbucha · September 10, 2013, 7:23pm

Amazing. Thank you!