Get column average using ID

I have a file that looks like this:

id window BV
1 1 0.5
1 2 0.2
1 3 0.1
2 1 0.5
2 2 0.1
2 3 0.2
3 1 0.4
3 2 0.6
3 3 0.8

Using awk, how would I get the average BV for window 1? Output like this:

window avgBV
1 0.47
2 0.23
3 0.37

How does your output correspond to your input?

The output is showing the average BV for window 1, 2 etc.

So the second column? Your suggested output does not seem to correspond to averages..
The average for window 1 is 0.466667 for example ( ( 0.5 + 0.5 + 0.4 ) / 3 )...

fixed

Not entirely. Is this homework?

Not homework, but research problem. I am a PhD student in quantitative animal genomics. I am a geneticist, but working on learning unix programming. I have large output files (500 mb) that are analyzing effects from 50,000 DNA markers in 2000 animals. The program divides the genome into 1,000,000 base pair segments called a 'window'. I need to extract the breeding value (BV) for each window averaged across the 2000 animals. The table above is an example of what my output looks like. I apologize if I am not meeting the formatting requirements for this forum, it is my first post. I have been unable to find an awk solution anywhere else, hence my post.

edit: I should clarify that even though I am a student this is not for a class, but real data that is part of my research.

OK, try something like:

awk 'NR==1; NR>1{A[$2]+=$3; C[$2]++} END{for (i in A) print i,A/C}' OFMT='%.2f'  file
2 Likes

OK that worked beautifully. Thank you very much.

Now, I am legitimately working on learning to use awk on the fly without pouring over forums to get what I need. You seem to have mastered this already. Would you be willing to break down the code you just wrote and tell me what each piece is doing? This is not something I see done very often , and most of the resources on unix programming do not go into enough detail on what is actually being issued. I am assuming others would benefit from this as well. Thank you in advance if you are willing to do that.

Sure, no problem:

awk '                
  NR==1                                # If it is the first line in the file it is the header, perform the default action, i.e. print the line { print $0 }
  NR>1{                                # For Any subsequent line, for every line :
    A[$2]+=$3                          # Create an element in associative array A with the second field as the index and the third field to its value..
    C[$2]++                            # Increase the counter in associative array C for field $2 
  } 
  END{                                 # When all lines are processed
    for (i in A) print i,A/C     # Run throught the array elements and print the index and the total value / divided by the number of occurences , using OFMT for the format.
  }
' OFMT='%.2f' file                    # set the format to 2 decimals and specify the file name.
2 Likes

Amazing. Thank you!