I am trying to output a tab-delimited result that uses the data from a tab-delimited file to combine and subtract specific lines.
If $4
matches in each line then the first matching sequential $6
value is added to $2
, unless the value is 1
, then the original $2
is used (like in the case of line 1). This is the new or adjusted $2
value.
The last matching sequential $6
value is added to $2
and this is the new or adjusted $3
value.
The new $2
and $3
vales are combined with $1
in the format $1:$2-$3
and the $5
value is printed on the line.
The awk command below works great as long as the $4
values are unique, but that is not always the case. I can not seem to add in a condition that checks $6
and if the numbers are not sequential (1 2 is, but then there is a break between 92 93 94), when there is a break a new line is created.
Maybe there is another way but hopefully this helps. Thank you
chrX 110956442 110956535 chrX:110956442-110956535 ALG13 1 19
chrX 110956442 110956535 chrX:110956442-110956535 ALG13 2 19
chrX 110956442 110956535 chrX:110956442-110956535 ALG13 92 18
chrX 110956442 110956535 chrX:110956442-110956535 ALG13 93 18
chrX 110956442 110956535 chrX:110956442-110956535 ALG13 94 18
chrX 110961329 110961512 chrX:110961329-110961512 ALG13 2 1
chrX 110961329 110961512 chrX:110961329-110961512 ALG13 3 1
chr15 25031028 25031925 chrX:25031028-25031925 ARX 651 3
desired output
chrX:110956442-110956444 ALG13
chrX:110956534-110956536 ALG13
chrX:110961331-110961332 ALG13
chr15:25031679-25031679 ARX
awk
awk 'FNR==NR {S[$4]++;next} ($4 in S){if(S[$4]>1){print $1 OFS $2 OFS $2+S[$4] OFS $5;}
else {if($6==1){print $1 OFS $2 OFS $2 OFS $5}
else {print $1 OFS $2+$6 OFS $2+$6 OFS $5}};delete S[$4]}' file file
current output
chrX 110956442 110956449 ALG13
chrX 110961329 110961334 ALG13
chr15 25031028 25031031 ARX