awk calculation wrong field output

The awk below is close but I can't seem to fix it to produce the desired output. Thank you :).

current awk with output

awk '{c1[$3]++; c2[$3]+=($2)}                                                                 
     END{for (e in c1) print e, c1[e], c2[e]}' input
EFCAB5 2 50
USH2A 2 19

desired output ($1 and $3 values from above)

EFCAB5 50
USH2A 19 

What's your input?

I apologize for that, it is just a text file in the below format:

4 fields ($1=location, $2=count, $3=id,$4=length)

chr1:123-456 2 EFCAB5 25
chr1:124-457 5 EFCAB5 25
chr2:1234-5678 3 USH2A 15
chr2:1235-5679 2 USH2A 4

The fields in bold (id and sum of matching lengths) are the ones in the desired output. Thank you :).

awk '{ A[$3] += $4 } END { for(X in A) print X, A[X] }' inputfile
1 Like

Would that do?

$ awk '{A[$3] += $4} END{for (i in A) print i, A}' cmccabe.fie
EFCAB5 50
USH2A 19
1 Like

So the id is read into the A array and a loop captures the matching lengths. I am not sure what X or i is? Sorry scientist trying to learn. Thank you :).

'for(X in A)' is a loop over every array index X in array A. The loop isn't for capturing it, just printing it after they're all read.

1 Like

The A is an associative array, meaning that index is not an ordered sequence of numbers, but an string. In order to get back the value you need to know the "index". for(i in A) is the way that AWK iterate over the A array, setting i to an "index" at each iteration. The disadvantage is that the order may no be as it was taken from the file nor it is guarantee to be the same order each time you run the program.

1 Like

Thank you both for your help and explanations, I really appreciate it :slight_smile: