FID IID MISS_PHENO N_MISS N_GENO F_MISS
12AB43131 12AB43131 N 17774 906341 0.01961
65HJ87451 65HJ87451 N 10149 906341 0.0112
43JJ21345 43JJ21345 N 2826 906341 0.003118
I would like an awk script that extracts the average, min and max values from the last column. The following code prints the minimum value:
awk 'min=="" || $6 < min {min=$6} END{print min}' file
Putting > instead of < naturally gives me the header instead. Putting NR>1 gives me the last number of the column, here 0.003118. If I remove the header, it works fine, but I would like to skip that step. The following code, however, works fine for finding the average:
More info on file: It could be as much as 1 million lines in total. The numbers are usually between 0 and 1, ie lots of decimal numbers. After the header, the column in question is only numeric. There will always be a number, but it might be either 0 or 1.
Thank you both,
both of the scripts work, though I understand more of the first one as a newbie. Do any of you know why it kept giving me the last line? What in the code I was using was telling awk to do that?
Not sure I understand. You were computing the minimum, and the last line's entry is the minimum - so outputting that line's value is exactly what you were asking for.
Ah, I should specify:
When I tried running the script I posted in the first thread, it kept giving me the last value in the actual huge file (not the example I posted), no matter if it was the smallest number or not. So something in the script is apparently telling it to post the last line, but I can't see what it is. Not a big deal, just want to understand awk a bit more.
Hmmm, feeling uneasy - how do you expect us to give you a meaningful answer if you are supplying incomplete | insufficient | malphrased | meaningless | wrong input data?
We don't even know which of the two scripts you posted in post #1 you were talking of. And you ran it on data we have never seen.
I can not post an example with 1 million lines on here. The data is exactly the same in terms of style of content, it just goes on for another 999.997 lines.
I stated in the first post that writing NR>1 gave me the last number of the column. I'm sorry I didn't realize that in my example this number was in fact the smallest one, but in my original file it clearly is not, that I can see from using the head-command. Using tail, I accidentally discovered that it was the last entry in the column that was posted instead.
As stated, both scripts work. I used it on a data set I have of about 300 rows, which I'm able to check using excel, something I can't do with the largest data set, as excel will most likely crash.
Yes, pamu, yours worked as well. I tested both on the 300 row file, and they gave the exact same output, and was verified by excel.
I hope to be able to understand more of the content of your script some day, but right now I have a deadline, so I will have to get back to it later. Looks really interesting though!
I did some minor changes to RudiC's script to fix line breaks etc. Is that what you were referring to, pamu?
Yes, this is why I posted the question in the first place. I did my entire pipeline of analysis (including much more than just this) using excel to do editing etc, but it just kept crashing and saving just a fraction of the file etc.
Redoing with awk now, which runs really smoothly in comparison, just need to get all the scripts right.