awk to adjust coordinates in field based on sequential numbers in another field

cmccabe · January 27, 2017, 11:58am

I am trying to output a tab-delimited result that uses the data from a tab-delimited file to combine and subtract specific lines.

If $4 matches in each line then the first matching sequential $6 value is added to $2 , unless the value is 1 , then the original $2 is used (like in the case of line 1). This is the new or adjusted $2 value.

The last matching sequential $6 value is added to $2 and this is the new or adjusted $3 value.

The new $2 and $3 vales are combined with $1 in the format $1:$2-$3 and the $5 value is printed on the line.

The awk command below works great as long as the $4 values are unique, but that is not always the case. I can not seem to add in a condition that checks $6 and if the numbers are not sequential (1 2 is, but then there is a break between 92 93 94), when there is a break a new line is created.

Maybe there is another way but hopefully this helps. Thank you

chrX    110956442   110956535   chrX:110956442-110956535    ALG13   1   19
chrX    110956442   110956535   chrX:110956442-110956535    ALG13   2   19
chrX    110956442   110956535   chrX:110956442-110956535    ALG13   92  18
chrX    110956442   110956535   chrX:110956442-110956535    ALG13   93  18
chrX    110956442   110956535   chrX:110956442-110956535    ALG13   94  18
chrX    110961329   110961512   chrX:110961329-110961512    ALG13   2   1
chrX    110961329   110961512   chrX:110961329-110961512    ALG13   3   1
chr15    25031028    25031925    chrX:25031028-25031925  ARX 651 3

desired output

chrX:110956442-110956444    ALG13
chrX:110956534-110956536    ALG13
chrX:110961331-110961332    ALG13
chr15:25031679-25031679  ARX

awk

awk 'FNR==NR {S[$4]++;next} ($4 in S){if(S[$4]>1){print $1 OFS $2 OFS $2+S[$4] OFS $5;} 
else {if($6==1){print $1 OFS $2 OFS $2 OFS $5}
else {print $1 OFS $2+$6 OFS $2+$6 OFS $5}};delete S[$4]}' file file

current output

chrX 110956442 110956449 ALG13
chrX 110961329 110961334 ALG13
chr15 25031028 25031031 ARX

MadeInGermany · January 27, 2017, 4:59pm

If I run your awk on your input then I do not get your output.
But maybe I have understood your description.
If your file is sorted by $4 and $6 (so $6 sequences are in adjacent lines),
then the following can do it:

awk '
# print from stored values
function prt(){
  print p1 ":" (p6start==1 ? p2 : p2+p6start) "-" p2+p6, p5
}
($4!=p4 || $6!=p6+1) {
# new sequence, print the previous sequence
  if (NR>1) prt()
  p6start=$6  
}
{
# store the values that we need later
  p1=$1
  p2=$2
  p4=$4
  p5=$5
  p6=$6
}
END { prt() }
' file

A problem is the "late" end-of-sequence detection. This is solved with storing the previous values, and an END section, and a print function.

Scrutinizer · January 28, 2017, 3:11am

Are you sure the output should not be:

chrX:110956442-110956444    ALG13
chrX:110956532-110956535    ALG13
chrX:110961330-110961332    ALG13
chr15:25031678-25031678  ARX

That would make more sense to me, maybe I'm wrong..

cmccabe · January 30, 2017, 7:39am

Thank you very much for your help and for catching the output correction, this is why the computer does the math :).