Use awk to read multiple files twice

scandy · March 9, 2012, 9:46am

Hi folks
I have a situation where I am trying to use awk to compute mean and standard deviation for a variable that spans across multiple files. The layout of each file is same and arranged in 3 columns and uses comma as a delimiter.

File1 layout:

col1,col2,col3

0,0-1,0.2345
1,1-2,0.3456
1,1-2,0.4567
2,2-3,0.5678

what I need to do is first scan each file (i have at least 200 files) and estimate the global mean of the third column for each index value given in the first colum over all files and then make a repeat pass to calculate the global standard deviation, again for each index value in the first column, over all files and using the global mean I calculated previously.

I thought of using awk for this as my file sizes are big and other scripting languages like Perl or ordinary bash are turning out to be too slow. I did a test and it seems awk can read these huge files line by line really quick but am stuck as to how to implement the actual stuff in awk.

Any help will be very useful.
Thanks

ctsgnb · March 9, 2012, 10:01am

I think it is a bit confusing.

Could you please give an example with (at least) 2 input files and show us the expected output file ?

Thanks for clarifying a little more

scandy · March 9, 2012, 10:09am

Sure, I will try my best. okay heres two files test1.txt and test2.txt:

test1.txt

0,0.0-0.1,0.00087
0,0.0-0.1,0.00089
1,0.1-0.2,0.00100
1,0.1-0.2,0.00074
1,0.1-0.2,0.00097
2,0.2-0.4,0.00208
2,0.2-0.4,0.00218
2,0.2-0.4,0.00227
3,0.9-1.0,0.00845
3,0.9-1.0,0.01016

test2.txt

0,0.0-0.1,0.00118
0,0.0-0.1,0.00131
0,0.0-0.1,0.00101
1,0.1-0.2,0.00015
1,0.1-0.2,0.00038
1,0.1-0.2,0.00122
2,0.2-0.4,0.00219
2,0.2-0.4,0.00214
2,0.2-0.4,0.00216
2,0.2-0.4,0.00199
3,0.9-1.0,0.01002
3,0.9-1.0,0.01070

the final output should be:

index    mean    std
0           m0        std0
1           m1        std1
2           m2       std2
3           m3        std3

where m0-3 are the global mean values for each index and std0-3 are the global standard deviations corresponding to each index. The index values are the ones given in the first column of each file. The third column is the one that I have to find global mean and std for.
Now I can calculate just the mean over all files fine. But the problem comes once I know the mean then how do I force awk to rescan all the files and use this mean to calculate the standard deviation.

Heres my awk code for calculating global mean:

#!/bin/awk -f
BEGIN{
    FS = ",";
    OFS = "\t";
    glbcnt[""]=0;
    glbacc[""]=0;
    glbprcn[""]=0;
}
{
    #print FILENAME;
    #if(FNR > 1){
        glbacc[$1] += $3;
        glbcnt[$1]++;
     #   }
}
END{
    for (i in glbcnt){
        if(i != ""){
            glbacc = glbacc/glbcnt;
            print i, glbacc, glbcnt; 
        }
    }
}

which I call like this:

awk -f test.awk test*.txt

where tes.awk is my awk script and the test*.txt are all my txt files having the 3 column values.
Hope now its more clear.

ctsgnb · March 12, 2012, 6:44am

$ cat myawk
BEGIN{
    FS = ",";
    OFS = "\t";
    glbcnt[""]=0;
    glbacc[""]=0;
    glbprcn[""]=0;
}
{
k[$1]=$2
e[$1":"NR]=$3
n=NR
}
END{
    print "indx","range","deviation","mean","num of elements";
    for(i in k){
        for (j=0;++j<=n;){
            if (e[i":"j]=="") continue
            glbacc+=e[i":"j]
            glbcnt++
        }
    }
    for(o in k){
        for (p=0;++p<=n;){
            if (e[o":"p]=="") continue
            delta[o":"p]=(e[o":"p]-glbacc[o])
            sumdelta[o]+=(delta[o":"p]^2)
        }
    }
    for(d in glbacc){
        if(d=="") continue
        glbacc[d] = glbacc[d]/glbcnt[d];
        drift[d]=sqrt(sumdelta[d]/glbcnt[d]);
        print d,k[d],drift[d],glbacc[d],glbcnt[d];
    }

}

$ awk -f myawk t1 t2
indx    range    deviation    mean    num of elements
0    0.0-0.1    0.00421142    0.001052    5
1    0.1-0.2    0.0037352    0.000743333    6
2    0.2-0.4    0.012866    0.00214429    7
3    0.9-1.0    0.0295094    0.0098325    4
$

This code assume that the pseudo code of the formula used to calculate the deviation is :

squareroot_of ( sum _of ( square_of(element - mean of elements) ) / number of elements )

Just feel free to adapt to your needs.

scandy · March 12, 2012, 9:09am

Hi ctsgnb
First of all please accept my thanks for taking out the time to offer me a solution. I appreciate your effort.
Your solution looks good but frankly I had thought along similar lines but the only problem with the solution below is that you are in effect storing all the values from the column 3 in awk arrays and then going over the arrays twice. I have a scenario where I will have at least 200+ files to process with each file having at least 11 million records so my main concern is storing all these values in arrays will be a huge drain on memory hence I was looking for ways to achieve this without having to store all the values in memory.
But again thanks a lot for your response and maybe there is no such solution out there.

ctsgnb · March 12, 2012, 9:42am

In one way or another, you will have to do 2 pass since the mean is needed to calculate the deviation.

If you can't do the 2 pass because of memory limitation you can either : split the task into shorter one that the memory can handle and/or go for the use a temporary file that you will then scan to calculated you deviation.

---------- Post updated at 02:42 PM ---------- Previous update was at 02:39 PM ----------

The real point is : achieving calculation & processing of such a data volume should be done at a database level, not at a scripting level.

scandy · March 12, 2012, 9:45am

Yeah i think those are the only options. I was hoping if theres any neat trick in awk that I don't know of, which allows me to scan the same input files twice but it seems I am just imagining.

This operation has to be done everyday on a set of files that gets updated daily and the files are themselves generated by a C code. So theres a master program that takes care of everything overall so theres some other implications for which database implementation though logical is not the preferred choice.

Thanks again for the feedback.

ctsgnb · March 12, 2012, 9:48am

Maybe you should consider the use of a RAM disk to store these files containing huge data that are daily updated...

Scrutinizer · March 12, 2012, 10:12am

You could feed the files twice, like so:

awk -f  test.awk test*.txt test*.txt

Then you can test if the second round of files is coming like so for example:

FILENAME==ARGV[1] && NR>FNR{secondround=1}
secondround{print "Making SD calculations"}

ARGV[1] contains the first filename. If the current filename FILENAME equals ARGV[1] for the second time, you will have processed all the files once...
This way you would need to store a lot less information

------------
Or you can do it like this, setting the variable half-way. Then you do not have to test inside, which might be a speed advantage:

awk -f  test.awk test*.txt secondround=1 test*.txt

and inside the awk script:

secondround{print "Making SD calculations"}

ctsgnb · March 12, 2012, 10:36am

Yup, your first proposal

FILENAME==ARGV[1] && NR>1{secondround=1} secondround{print "Making SD calculations"}

doesn't work since as soon as NR>1 (even if we are in the first round, it will setup secondround to 1)

But the following should do the trick

awk '!f[FILENAME":"FNR]++{
print "firstround";
next}
{ 
print "second round here"
}' tst* tst*

Scrutinizer · March 12, 2012, 10:51am

That's right, I meant to write: FILENAME==ARGV[1] && NR>FNR . Anyway I much prefer my second option, it is cleanest.
I think your suggestion would mean a lot of extra array elements, no? Why not leave out ":"FNR and use a variable?

ctsgnb · March 12, 2012, 11:05am

@Scruti,

Yeah i agree that setting the variable directly in the ARGV line is the best option, i guess we just cross-posted, so that when i wrote my previous post i didn't see the last update of yours (#9)

Scrutinizer · March 12, 2012, 11:09am

Actually I updated my post after you pointed out the mistake ( see edit comment ) ...

scandy · March 12, 2012, 11:50am

Thanks both of you guys. This discussion has been really useful specially this latest round of comments from you both Scrutinizer and ctsgnb. I did not know that you could actually pass awk variable in the middle of the files. Okay this maybe worth a try.
Thanks again for this.