awk split and awk calculation in the same command

I am trying to run the awk below. My question is when I split the input, then run another awk to perform a calculation using that split as the input there are no issues. When I try to combine them the output is not correct, is the split not working or did I do it wrong? Thank you :).

input

 
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    1    15
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    2    16
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    3    16
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    4    14
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    5    17

after the split awk '{split($5,a,"-"); print $1,$2,$3,$4,a[1]}' input > split

split (uses the - in $5 and prints $1,$2,$3,$4,and the split a[1]

chr1 955543 955763 chr1:955543-955763 AGRN
chr1 955543 955763 chr1:955543-955763 AGRN
chr1 955543 955763 chr1:955543-955763 AGRN
chr1 955543 955763 chr1:955543-955763 AGRN
chr1 955543 955763 chr1:955543-955763 AGRN

if I use that file (split) in the below awk the output is correct

output ($5 count of lines that are the same and the sum of $3-$2

AGRN 5 1100

If I try to perform the split and run the calculation in the same awk , I get the below output:

awk '{split($5,a,"-"); print $1,$2,$3,$4,a[1]} {c1[a1]++; c2[a1]+=($3-$2)}
>     END{for (e in c1) print e, c1[e], c2[e]}' split

output

chr1 955543 955763 chr1:955543-955763 AGRN
chr1 955543 955763 chr1:955543-955763 AGRN
chr1 955543 955763 chr1:955543-955763 AGRN
chr1 955543 955763 chr1:955543-955763 AGRN
chr1 955543 955763 chr1:955543-955763 AGRN
 5 1100 

use a[1] instead of a1

1 Like

Try c1[a[1]] instead of c1[a1]

1 Like

Thank you both :slight_smile:

The below awk is producing different results since my input changed on some lines from (old input):
I can not seem to fix this and need some expert help :)... thank you :).

awk

awk '{split($5,a,"-"); print $1,$2,$3,$4,a[1]} {c1[a[1]]++; c2[a[1]]+=($3-$2)}
     END{for (e in c1) print e, c1[e], c2[e]}' input 

old input

chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    1    15
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    2    16
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    3    16
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    4    14
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    5    17 

output -- this is correct

chr1 955543 955763 chr1:955543-955763 AGRN
chr1 955543 955763 chr1:955543-955763 AGRN
chr1 955543 955763 chr1:955543-955763 AGRN
chr1 955543 955763 chr1:955543-955763 AGRN
chr1 955543 955763 chr1:955543-955763 AGRN
AGRN 5 1100 

using the awk on the attached fille:

out.txt (attachment) -- does not count $5 or subtract each $3-$2 as it did before

chr7 121738788 121738930 chr7:121738788-121738930 AASS
chr7 121738788 121738930 chr7:121738788-121738930 AASS
chr7 121738788 121738930 chr7:121738788-121738930 AASS
chr7 121741414 121741502 chr7:121741414-121741502 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS
chr7 121769404 121769601 chr7:121769404-121769601 AASS

Seems to work for me - when applied to a small subset of your input.

.
.
.
chr17 67160204 67160305 chr17:67160204-67160305 ABCA10
ABCA1 1 144
AASS 19 3469
ABCA10 14 1414

On a small subset the awk seems to work, but using the actual input (attached, which is much larger) what the awk seems to be doing is printing the split and then the calculation below it. Do I need to perform the split separate in order not to see it in the output? Or can I print the results of the split in one file then use that file to output the calculations? Thank you :).

. . . 
chr17 67160204 67160305 chr17:67160204-67160305 ABCA10  -- up to here is split 
ABCA1 1 144 
AASS 19 3469 
ABCA10 14 1414

Works for me as well with your attached input file; result is exactly equivalent to your correct output in post#5. The appended calculated list has 1387 entries like

.
.
.
EMX1 20 10800
DYNC1H1 7 756
TPRN 673 1093315
SSPN 42 12558
PCDHA1;PCDHA10;PCDHA11;PCDHA12;PCDHA13;PCDHA2;PCDHA3;PCDHA4;PCDHA5;PCDHA6;PCDHA7;PCDHA8;PCDHA9;PCDHAC1;PCDHAC2 3 993
NR2F1 183 88389
FOXO1 180 117000
EVC2 26 8970
.
.
.

So - what exactly don't you like?

What I call the split is:

.
.
.
chr17 67160204 67160305 chr17:67160204-67160305 ABCA10 

and the calculation using the fields in the split

ABCA10 14 1414

Can the split portion be put in another file so that all that is outputted in the calculation? Thank you :).