Average, min and max in file with header, using awk

kayakj · February 20, 2013, 8:33am

Hi,
I have a file which looks like this:

     FID        IID MISS_PHENO   N_MISS   N_GENO   F_MISS
  12AB43131   12AB43131          N    17774   906341  0.01961
  65HJ87451   65HJ87451          N    10149   906341   0.0112
  43JJ21345   43JJ21345          N     2826   906341 0.003118

I would like an awk script that extracts the average, min and max values from the last column. The following code prints the minimum value:

awk 'min=="" || $6 < min {min=$6} END{print min}' file

Putting > instead of < naturally gives me the header instead. Putting NR>1 gives me the last number of the column, here 0.003118. If I remove the header, it works fine, but I would like to skip that step. The following code, however, works fine for finding the average:

awk 'NR>1{sum+=$6}END{print "Average missing = ",sum/(NR-1)}' missing_200213.imiss

More info on file: It could be as much as 1 million lines in total. The numbers are usually between 0 and 1, ie lots of decimal numbers. After the header, the column in question is only numeric. There will always be a number, but it might be either 0 or 1.

Thank you!

RudiC · February 20, 2013, 8:42am

Well, you're on the right track, just need to get your ducks in a row. Look at this an adapt to your needs:

awk     'NR==1  {max=0;min=1}
         NR>1   {sum+=$6
                 if (min>$6) min=$6
                 if (max<$6) max=$6
                 cnt = NR
                }
         END    {print  sum/(cnt-1), min, max}
        ' file
0.0113093 0.003118 0.01961

Some further reading on awk might be really helpful.

pamu · February 20, 2013, 8:49am

$  awk 'NR>1{a+=$NF;
  max=max>$NF?max:$NF;
  min=min>$NF||!min?$NF:min}
  END{print a/(NR-1),max,min}' file

0.0113093 0.01961 0.003118

kayakj · February 20, 2013, 8:54am

Thank you both,
both of the scripts work, though I understand more of the first one as a newbie. Do any of you know why it kept giving me the last line? What in the code I was using was telling awk to do that?

RudiC · February 21, 2013, 1:36am

Not sure I understand. You were computing the minimum, and the last line's entry is the minimum - so outputting that line's value is exactly what you were asking for.

kayakj · February 21, 2013, 4:33am

Ah, I should specify:
When I tried running the script I posted in the first thread, it kept giving me the last value in the actual huge file (not the example I posted), no matter if it was the smallest number or not. So something in the script is apparently telling it to post the last line, but I can't see what it is. Not a big deal, just want to understand awk a bit more.

RudiC · February 21, 2013, 4:49am

Hmmm, feeling uneasy - how do you expect us to give you a meaningful answer if you are supplying incomplete | insufficient | malphrased | meaningless | wrong input data?
We don't even know which of the two scripts you posted in post #1 you were talking of. And you ran it on data we have never seen.

kayakj · February 21, 2013, 4:57am

I can not post an example with 1 million lines on here. The data is exactly the same in terms of style of content, it just goes on for another 999.997 lines.
I stated in the first post that writing NR>1 gave me the last number of the column. I'm sorry I didn't realize that in my example this number was in fact the smallest one, but in my original file it clearly is not, that I can see from using the head-command. Using tail, I accidentally discovered that it was the last entry in the column that was posted instead.
As stated, both scripts work. I used it on a data set I have of about 300 rows, which I'm able to check using excel, something I can't do with the largest data set, as excel will most likely crash.

pamu · February 21, 2013, 5:05am

Have you tried my solution.. ?

Please check below..

$ cat file
   FID        IID MISS_PHENO   N_MISS   N_GENO   F_MISS
  12AB43131   12AB43131          N    17774   906341  0.01961
  65HJ87451   65HJ87451          N    10149   906341   0.0112
  43JJ21345   43JJ21345          N     2826   906341 0.003118
  43JJ21345   43JJ21345          N     2826   906341 0.3119
  43JJ21345   43JJ21345          N     2826   906341 0.3118

$ awk 'NR>1{a+=$NF;
  max=max>$NF?max:$NF;
  min=min>$NF||!min?$NF:min}
  END{print a/(NR-1),max,min}' file
0.131526 0.3119 0.003118

Same result got using Rudic's script.

kayakj · February 21, 2013, 5:12am

Yes, pamu, yours worked as well. I tested both on the 300 row file, and they gave the exact same output, and was verified by excel.

I hope to be able to understand more of the content of your script some day, but right now I have a deadline, so I will have to get back to it later. Looks really interesting though!

I did some minor changes to RudiC's script to fix line breaks etc. Is that what you were referring to, pamu?

pamu · February 21, 2013, 5:17am

Ya Sure.

You can use both the script to get your desired output.

And i think you can use awk for large records. Unlike excel it won't crash.. (probably)

kayakj · February 21, 2013, 5:21am

Yes, this is why I posted the question in the first place. I did my entire pipeline of analysis (including much more than just this) using excel to do editing etc, but it just kept crashing and saving just a fraction of the file etc.

Redoing with awk now, which runs really smoothly in comparison, just need to get all the scripts right.

Thank you so much for all your help!