Sum specified values (columns) per row

pathunkathunk · April 18, 2014, 6:59pm

Hello out there,

file.txt:

comp51820_c1_seq1	42	N	0:0:0:0:0:0	1:0:0:0:0:0	0:0:0:0:0:0	3:0:0:0:0:0	0:0:0:0:0:0
comp51820_c1_seq1	43	N	0:0:0:0:0:0	0:1:0:0:0:0	0:0:0:0:0:0	0:3:0:0:0:0	0:0:0:0:0:0
comp51820_c1_seq1	44	N	0:0:4:0:3:1	0:0:1:9:0:0	10:0:0:0:0:0	0:3:3:2:2:6	2:2:2:5:60:3
comp51820_c1_seq1	45	N	0:4:0:0:5:0	0:1:8:0:0:0	0:0:0:0:9:0	0:3:0:0:6:0	0:0:0:0:13:0

I would like to compare the sums of the colon-separated values in columns 4-8. Specifically, I would like to print the lines for which the sum of each column (4 through 8) is at least 8. Caveat: I need to ignore the fifth value in each column (this excludes line four in file.txt).

Desired output:

comp51820_c1_seq1	44	N	0:0:4:0:3:1	0:0:1:9:0:0	10:0:0:0:0:0	0:3:3:2:2:6	2:2:2:5:60:3

So far I have tried to use awk to sum columns, but I'm not sure how to compare multiple sums per line. Plus it seems there must be a better way but google hasn't given me an answer so far.

awk 'BEGIN { FS = ":| " } ; { col1+=$4+$5+$6+$7+$8+$9 }

vgersh99 · April 18, 2014, 7:31pm

awk -f path.awk myInputFile where path.awk is:

{
  f=0
  for(i=4;i<=NF;i++) {
    n=split($i,a,":")
    s=0
    for(j=1;j<n;j++) s+=a[j]
    if (s>=8) {
       f++
       break
    }
  }
}
f

pathunkathunk · April 18, 2014, 7:52pm

If I'm not mistaken, this seems to print all files with at least one value >=8. I get the output:

comp51820_c1_seq1	44	N	0:0:4:0:3:1	0:0:1:9:0:0	10:0:0:0:0:0	0:3:3:2:2:6	2:2:2:5:60:3
comp51820_c1_seq1	45	N	0:4:0:0:5:0	0:1:8:0:0:0	0:0:0:0:9:0	0:3:0:0:6:0	0:0:0:0:13:0

However I would like to print only lines where ALL columns 4-8 have sum >=8, EXCLUDING the fifth value.

vgersh99 · April 18, 2014, 8:20pm

ah, ok:

{  
    f=0   
    for(i=4;i<=NF;i++) { 
       n=split($i,a,":")
       s=0  
       for(j=1;j<n;j++) s+=a[j] 
       if (s>=8)  f++
   } 
}
 f==(NF-4)

pathunkathunk · April 18, 2014, 8:35pm

Thank you, this works. Would you mind also showing how I would do it if I did not want to exclude the fifth value? So just print lines where the sum of ALL values in each of columns 4-8 is at least 8? I can't tell which part of the code addresses this part of it.

vgersh99 · April 18, 2014, 8:57pm

for(j=1;j<=n;j++) s+=a[j]

pathunkathunk · April 18, 2014, 9:48pm

This works for the example file, but when I try it with my larger file there are some issues. For example if these lines are the input:

comp51820_c1_seq1	405	N	0:29:0:0:0:0	0:51:0:0:0:0	0:57:0:0:0:0	0:6:0:0:0:0	0:37:0:0:0:0
comp51820_c1_seq1	406	N	0:0:0:29:0:0	0:0:0:51:0:0	0:0:0:57:0:0	0:0:0:6:0:0	0:0:0:37:0:0
comp51820_c1_seq1	407	N	0:0:0:31:0:0	0:0:0:48:0:0	0:0:0:59:0:0	0:0:0:8:0:0	0:0:0:45:0:0
comp51820_c1_seq1	408	N	0:31:0:0:0:0	0:51:0:0:0:0	0:60:0:0:0:0	0:9:0:0:0:0	0:48:0:0:0:0
comp51820_c1_seq1	409	N	0:1:0:0:0:0	0:51:0:0:0:0	0:60:0:0:0:0	0:9:0:0:0:0	0:48:0:0:0:0

The output I get is:

-bash-4.1$ cat path.awk 
{   
	f=0   
	for(i=4;i<=NF;i++) {     
		n=split($i,a,":")     
		s=0     
		for(j=1;j<=n;j++) s+=a[j]     
		if (s>=8)        
		f++   } 
} 
f==(NF-4)
-bash-4.1$ 
-bash-4.1$ awk -f path.awk file.txt
comp51820_c1_seq1	405	N	0:29:0:0:0:0	0:51:0:0:0:0	0:57:0:0:0:0	0:6:0:0:0:0	0:37:0:0:0:0
comp51820_c1_seq1	406	N	0:0:0:29:0:0	0:0:0:51:0:0	0:0:0:57:0:0	0:0:0:6:0:0	0:0:0:37:0:0
comp51820_c1_seq1	409	N	0:1:0:0:0:0	0:51:0:0:0:0	0:60:0:0:0:0	0:9:0:0:0:0	0:48:0:0:0:0

This output is incorrect because the sum of $7 is < 8 in the first two lines, and the sum of $4 < 8 in the third line.

Expected output is:

comp51820_c1_seq1	407	N	0:0:0:31:0:0	0:0:0:48:0:0	0:0:0:59:0:0	0:0:0:8:0:0	0:0:0:45:0:0
comp51820_c1_seq1	408	N	0:31:0:0:0:0	0:51:0:0:0:0	0:60:0:0:0:0	0:9:0:0:0:0	0:48:0:0:0:0

The latest version also don't seem to work with the original example file:

-bash-4.1$ cat file2.txt 
comp51820_c1_seq1	42	N	0:0:0:0:0:0	1:0:0:0:0:0	0:0:0:0:0:0	3:0:0:0:0:0	0:0:0:0:0:0
comp51820_c1_seq1	43	N	0:0:0:0:0:0	0:1:0:0:0:0	0:0:0:0:0:0	0:3:0:0:0:0	0:0:0:0:0:0
comp51820_c1_seq1	44	N	0:0:4:0:3:1	0:0:1:9:0:0	10:0:0:0:0:0	0:3:3:2:2:6	2:2:2:5:60:3
comp51820_c1_seq1	45	N	0:4:0:0:5:0	0:1:8:0:0:0	0:0:0:0:9:0	0:3:0:0:6:0	0:0:0:0:13:0
-bash-4.1$ 
-bash-4.1$ awk -f path.awk file2.txt
-bash-4.1$ awk -f path2.awk file2.txt
comp51820_c1_seq1	44	N	0:0:4:0:3:1	0:0:1:9:0:0	10:0:0:0:0:0	0:3:3:2:2:6	2:2:2:5:60:3
-bash-4.1$ 
-bash-4.1$ cat path2.awk 
{   
	f=0   
	for(i=4;i<=NF;i++) {     
		n=split($i,a,":")     
		s=0     
		for(j=1;j<n;j++) s+=a[j]     
		if (s>=8)        
		f++   } 
} 
f==(NF-4)
-bash-4.1$ 
-bash-4.1$ cat path.awk 
{   
	f=0   
	for(i=4;i<=NF;i++) {     
		n=split($i,a,":")     
		s=0     
		for(j=1;j<=n;j++) s+=a[j]     
		if (s>=8)        
		f++   } 
} 
f==(NF-4)

itkamaraj · April 18, 2014, 11:42pm

$ cat test.txt
comp51820_c1_seq1    405    N    0:29:0:0:0:0    0:51:0:0:0:0    0:57:0:0:0:0    0:6:0:0:0:0    0:37:0:0:0:0
comp51820_c1_seq1    406    N    0:0:0:29:0:0    0:0:0:51:0:0    0:0:0:57:0:0    0:0:0:6:0:0    0:0:0:37:0:0
comp51820_c1_seq1    407    N    0:0:0:31:0:0    0:0:0:48:0:0    0:0:0:59:0:0    0:0:0:8:0:0    0:0:0:45:0:0
comp51820_c1_seq1    408    N    0:31:0:0:0:0    0:51:0:0:0:0    0:60:0:0:0:0    0:9:0:0:0:0    0:48:0:0:0:0
comp51820_c1_seq1    409    N    0:1:0:0:0:0    0:51:0:0:0:0    0:60:0:0:0:0    0:9:0:0:0:0    0:48:0:0:0:0

$ awk '{for(i=4;i<=NF;i++){n=split($i,arr,":");for(j=1;j<=n;j++)tot+=arr[j];if(tot==8){print;next}tot=0;}}' test.txt
comp51820_c1_seq1    407    N    0:0:0:31:0:0    0:0:0:48:0:0    0:0:0:59:0:0    0:0:0:8:0:0    0:0:0:45:0:0

pathunkathunk · April 18, 2014, 11:54pm

This works to exclude the correct lines, but it also excludes a line that should be accepted (since the five sums (one from each column) are all >= 8). Desired output is:

comp51820_c1_seq1	407	N	0:0:0:31:0:0	0:0:0:48:0:0	0:0:0:59:0:0	0:0:0:8:0:0	0:0:0:45:0:0
comp51820_c1_seq1	408	N	0:31:0:0:0:0	0:51:0:0:0:0	0:60:0:0:0:0	0:9:0:0:0:0	0:48:0:0:0:0

I'm trying to understand why the first line above is excluded from the output...

itkamaraj · April 19, 2014, 12:30am

sorry.. i didnt understand your question...

so, you mean.. if any of the column sum is less than 8, then it should not print the line ?

---------- Post updated at 12:30 PM ---------- Previous update was at 12:17 PM ----------

so.. something like this ?

$ awk '{for(i=4;i<=NF;i++){n=split($i,arr,":");tot=0;for(j=1;j<=n;j++)tot+=arr[j];if(tot>=8){flag+=1;}}if(flag==5)print;flag=0}' test.txt
comp51820_c1_seq1    407    N    0:0:0:31:0:0    0:0:0:48:0:0    0:0:0:59:0:0    0:0:0:8:0:0    0:0:0:45:0:0
comp51820_c1_seq1    408    N    0:31:0:0:0:0    0:51:0:0:0:0    0:60:0:0:0:0    0:9:0:0:0:0    0:48:0:0:0:0

pathunkathunk · April 19, 2014, 12:33am

Yes that's right. That's a better way of saying it.
So if input is:

comp51820_c1_seq1    405    N    0:29:0:0:0:0    0:51:0:0:0:0    0:57:0:0:0:0    0:6:0:0:0:0    0:37:0:0:0:0
comp51820_c1_seq1    406    N    0:0:0:29:0:0    0:0:0:51:0:0    0:0:0:57:0:0    0:0:0:6:0:0    0:0:0:37:0:0
comp51820_c1_seq1    407    N    0:0:0:31:0:0    0:0:0:48:0:0    0:0:0:59:0:0    0:0:0:8:0:0    0:0:0:45:0:0
comp51820_c1_seq1    408    N    0:31:0:0:0:0    0:51:0:0:0:0    0:60:0:0:0:0    0:9:0:0:0:0    0:48:0:0:0:0
comp51820_c1_seq1    409    N    0:1:0:0:0:0    0:51:0:0:0:0    0:60:0:0:0:0    0:9:0:0:0:0    0:48:0:0:0:0

Then output is:

comp51820_c1_seq1    407    N    0:0:0:31:0:0    0:0:0:48:0:0    0:0:0:59:0:0    0:0:0:8:0:0    0:0:0:45:0:0
comp51820_c1_seq1    408    N    0:31:0:0:0:0    0:51:0:0:0:0    0:60:0:0:0:0    0:9:0:0:0:0    0:48:0:0:0:0

itkamaraj · April 19, 2014, 12:40am

read my above post

pathunkathunk · April 19, 2014, 1:01am

That does the job, thank you!

(I was reading quickly and skipped over your solution thinking it was just a quote from the previous message.)

RudiC · April 19, 2014, 6:27am

Didn't you say you need to exempt the fifth value in any column? Try (based on itkamaraj's proposal):

awk     '       {for (i=4; i<=NF; i++){
                          n = split ($i, TMP, ":")
                          tot = -TMP[5] 
                          for (j=1; j<=n; j++) tot += TMP[j]
                          if (tot<8) break
                        }
                        if (i>NF) print
                }
        ' file

vgersh99 · April 19, 2014, 1:19pm

sorry, a tiny boo-boo:

{
        f=0
        for(i=4;i<=NF;i++) {
                n=split($i,a,":")
                s=0
                for(j=1;j<n;j++) s+=a[j]
                  if (s>=8)
                    f++
        }
}
f==(NF-3)

EAGL · April 28, 2014, 2:52am

Hello Vgersh,

PLease don't mind me asking about the "n" in this split operation, I would like to learn it,

n=split($i,a,":")
i is 4,5,6,7,8 in our case (first case i used),
for $4, $5,...$8 you split and set the sub fields to array "a" ,
what would be the values of n?

n= a[1], a[2],...,a[6] ???

For $4, when i = 4, i used 3rd Line and 4th Column then is n like the following?

A[1]= A[2]=A[4]=0
A[3]=4
A[5]=3
A[6]=1

if so how you can compare j and n in

for(j=1;j<n;j++)

this part confused me, I appreciate if you could explain it a bit,

KR,
EAGLE

vgersh99 · April 29, 2014, 4:07pm

n=split($i,a,":") for field 0:29:0:0:0:0 results in

n=6  # 6 elements from string '0:29:0:0:0:0' separated by ':'
a[1]=0
a[2]=29
a[3]=0
a[4]=0
a[5]=0
a[6]=0

'j' is just a temporary iterator that goes from 1 to 'n' (the number of elements in the split string saved in array 'a').
Not sure if addressed the question though ....