Summing columns over group of lines

sdp · April 24, 2013, 7:38am

I have an input file that looks like:

ID1 V1 ID2 V2 P1 P2 P3 P4 ..... n no. of columns
1 1 1 1 1.0000 1.0000 1.0000 1.0000
1 1 1 2 0.9999 0.8888 0.7777 0.6666
1 2 1 1 0.8888 0.7777 0.6666 0.5555
1 2 1 2 0.7777 0.6666 0.5555 0.4444
2 1 1 1 0.6666 0.5555 0.4444 0.3333
2 1 1 2 0.5555 0.4444 0.3333 0.2222
2 2 1 1 0.4444 0.3333 0.2222 0.1111
2 2 1 2 0.3333 0.2222 0.1111 0.1234

I would like to pick each field from column 5 i.e. P1 over each group of four lines and add them. The output needs to look like

ID1 ID2 P1 P2 P3 P4 .....n columns
1 1 3.6664 3.3331 ...... so on
2 1 1.9998 1.5554 ....... so on

Is there a way to do this using awk scripts ???

Don_Cragun · April 24, 2013, 8:38am

The short answer is yes.
But, I'm not sure I understand your requirements. If you want help creating an awk script to perform this taks, please answer the following questions:

Do you want input fields 2 and 4 to be removed from every input line?
If the value in input field 1 is not a constant in each set of four input lines, what happens?
[list=a]
Is that set skipped? If so, should an error be printed?
Should the value from the first line in the set be printed?
Should the value from the last line in the set be printed?
[/list]
If the value in input field 3 is not a constant in each set of four input lines, what happens?
[list=a]
Is that set skipped? If so, should an error be printed?
Should the value from the first line in the set be printed?
Should the value from the last line in the set be printed?
[/list]
Is the number of fields a constant for a given input file?

sdp · April 24, 2013, 8:48am

don cragun:

The short answer is yes.
But, I'm not sure I understand your requirements. If you want help creating an awk script to perform this taks, please answer the following questions:

Do you want input fields 2 and 4 to be removed from every input line?

If the value in input field 1 is not a constant in each set of four input lines, what happens?
[list=a]

Is that set skipped? If so, should an error be printed?

Should the value from the first line in the set be printed?

Should the value from the last line in the set be printed?
[/list]

If the value in input field 3 is not a constant in each set of four input lines, what happens?
[list=a]

Is that set skipped? If so, should an error be printed?

Should the value from the first line in the set be printed?

Should the value from the last line in the set be printed?
[/list]

Is the number of fields a constant for a given input file?

Hi Don, thanks for taking the time. Below, are the answers to your questions:

Do you want input fields 2 and 4 to be removed from every input line?
Yes - they need to be removed
If the value in input field 1 is not a constant in each set of four input lines, what happens?
-Is that set skipped? If so, should an error be printed?
-Should the value from the first line in the set be printed?
-Should the value from the last line in the set be printed?
Basically, values in fields 1 and 3 will remain constant for every set of 4 lines. Hence for every group of four lines, I need the values in the first line for these fields.
If the value in input field 3 is not a constant in each set of four input lines, what happens?
-Is that set skipped? If so, should an error be printed?
-Should the value from the first line in the set be printed?
-Should the value from the last line in the set be printed?
As stated above... the values in input fields 1 and 3 will remain constant for every set of four lines

4.Is the number of fields a constant for a given input file?
Yes the number of fields is constant for a given input file.

The following is the code that I am working with right now - though it isnt working and does not have all the features I need

awk '{for(j=5;j<=NF;j++) {!(NR%4){sum+=$j}{printf("%04d ", sum/2)}} {print "\n"}}'

Thanks again!!

Don_Cragun · April 24, 2013, 10:26am

The following awk script is a little more complex than you requested. It allows processing of multiple input files, prints an end of file separator if more than one input file is given, and prints a warning if there are lines left at the end of a file that don't make up a complete 4 line set.

Since you said all values in a 4 line set are constant in fields 1 and 3, I used the values in the last line of the set instead of in the 1st line of the set (it saved me from needing to create two more variables). If you really need the 1st line's values instead of the last line's values, it won't be hard to change this script to do that.

As always, if you're using a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of awk .

awk '
FNR == 1 {
        # Check for incomplete set at end of previous file.
        if(l) {
                printf("%d line(s) skipped at end of %s.\n", l, file)
                for(i = 5; i <= n; i++) s = 0
        }
        # Print file trailer if more than 1 file has been seen.
        if(nf++) printf("================== End of data from file %s\n", file)
        # Process headers: Print output headers, determine field count.
        printf("%s %s ", $1, $3)
        for(i = 5; i <= NF; i++) printf("%s%s", $i, i == NF ? "\n" : " ")
        n = NF          # set number of fields for this file
        l = 0           # set number of lines in current set
        file = FILENAME # save filename for diagnostics
        next
}
{       for(i = 5; i <= n; i++) s += $i}
++l == 4 {
        l = 0
        printf("%d %d ", $1, $3)
        for(i = 5; i <= NF; i++) {
                printf("%6.4f%s", s, i == NF ? "\n" : " ")
                s = 0
        }
}
END {   if(l) printf("%d line(s) skipped at end of %s.\n", l, file)
        if(nf > 1) printf("================== End of data from file %s\n", file)
}' input

sdp · April 24, 2013, 10:33am

Many thanks Don !! This certainly is more complex than what I was trying to do - but it does the job perfectly.

Thanks again!!