awk calculation wrong field output

cmccabe · May 19, 2016, 4:54pm

The awk below is close but I can't seem to fix it to produce the desired output. Thank you :).

current awk with output

awk '{c1[$3]++; c2[$3]+=($2)}                                                                 
     END{for (e in c1) print e, c1[e], c2[e]}' input
EFCAB5 2 50
USH2A 2 19

desired output ($1 and $3 values from above)

EFCAB5 50
USH2A 19

Corona688 · May 19, 2016, 5:27pm

What's your input?

cmccabe · May 19, 2016, 5:48pm

I apologize for that, it is just a text file in the below format:

4 fields ($1=location, $2=count, $3=id,$4=length)

chr1:123-456 2 EFCAB5 25
chr1:124-457 5 EFCAB5 25
chr2:1234-5678 3 USH2A 15
chr2:1235-5679 2 USH2A 4

The fields in bold (id and sum of matching lengths) are the ones in the desired output. Thank you :).

Corona688 · May 19, 2016, 6:36pm

awk '{ A[$3] += $4 } END { for(X in A) print X, A[X] }' inputfile

Aia · May 19, 2016, 6:41pm

Would that do?

$ awk '{A[$3] += $4} END{for (i in A) print i, A}' cmccabe.fie
EFCAB5 50
USH2A 19

cmccabe · May 19, 2016, 6:51pm

So the id is read into the A array and a loop captures the matching lengths. I am not sure what X or i is? Sorry scientist trying to learn. Thank you :).

Corona688 · May 19, 2016, 7:03pm

'for(X in A)' is a loop over every array index X in array A. The loop isn't for capturing it, just printing it after they're all read.

Aia · May 19, 2016, 8:29pm

The A is an associative array, meaning that index is not an ordered sequence of numbers, but an string. In order to get back the value you need to know the "index". for(i in A) is the way that AWK iterate over the A array, setting i to an "index" at each iteration. The disadvantage is that the order may no be as it was taken from the file nor it is guarantee to be the same order each time you run the program.

cmccabe · May 21, 2016, 7:18am

Thank you both for your help and explanations, I really appreciate it