Finding contiguous numbers in a list but with a gap number tolerance

Dear all,
I have a imput file like this
imput

scaffold_0      10558458        10558459        1.8
scaffold_0      10558464        10558465        1.75
scaffold_0      10558467        10558468        1.8
scaffold_0      10558468        10558469        1.71428571428571
scaffold_0      10558469        10558470        1.71428571428571
scaffold_0      10558470        10558471        1.71428571428571
scaffold_0      10558471        10558472        1.59090909090909
scaffold_0      10558472        10558473        1.66666666666667
scaffold_0      10558473        10558474        1.75
scaffold_0      10558474        10558475        1.75
scaffold_0      10558475        10558476        1.7
scaffold_0      10558476        10558477        1.7
scaffold_0      10558477        10558478        1.7
scaffold_0      10558478        10558479        1.7
scaffold_0      10558479        10558480        1.7
scaffold_0      10558480        10558481        1.61904761904762
scaffold_0      10577262        10577263        1.6

I would like to retrieve the lines are relative to contiguous number presented in the second column. In this examples, I would have:

output

scaffold_0      10558467        10558468        1.8
scaffold_0      10558468        10558469        1.71428571428571
scaffold_0      10558469        10558470        1.71428571428571
scaffold_0      10558470        10558471        1.71428571428571
scaffold_0      10558471        10558472        1.59090909090909
scaffold_0      10558472        10558473        1.66666666666667
scaffold_0      10558473        10558474        1.75
scaffold_0      10558474        10558475        1.75
scaffold_0      10558475        10558476        1.7
scaffold_0      10558476        10558477        1.7
scaffold_0      10558477        10558478        1.7
scaffold_0      10558478        10558479        1.7
scaffold_0      10558479        10558480        1.7
scaffold_0      10558480        10558481        1.61904761904762

Note that the line "scaffold_0 10558464 10558465 1.75" is not included, because is missing the numbers 10558465 and 10558466. However, I would like to have a tolerance up to five number, which would include that line and others that have a gap up to 5 numbers.

Anybody could help me?

Cheers.

Try something like this

sort -nk2 file | awk '{if($2 == s){print P;k=1}else if(1==k){print P;k=0;}}  
  {s=$3;P=$0}'

To decide tolerance difference use

sort -nk2 file | awk -v tol="5" '{if(($2-s)<=tol){print P;k=1}else if(1==k){print P;k=0;}}  
  {s=$3;P=$0}'
1 Like

Hi Pamu, thanks for your script. I'm very happy with the results.
I just have a problem that I would like to have your help, because I'm a newbie in these things.
My imput is:

imput

scaffold_0    1    2    1.6
scaffold_0    2    3    1.6
scaffold_0    100    101    1.6
scaffold_0    104    105    1.6
scaffold_100    1    2    1.6
scaffold_100    1000    1001    1.6
scaffold_65    543    544    1.6
scaffold_10    1    2    1.6
scaffold_10    200    201    1.6
scaffold_10    1000    1001    1.6

Runing the next script

script

#!/bin/bash
cat imput |cut -f1 |sort |uniq >scaffolds
wait
while read line
do
one_position=`grep -w -c "$line" teste`
wait
    if [ "$one_position" -ne "0" ] #-eq not equal 
      then
        cat imput |grep -w "$line" |sed 's/scaffold_/scaffold_ /g' |sort -nk2 -nk3 |sed 's/scaffold_ /scaffold_/g' | awk -v tol="1" '{if(($2-s)<=tol){print P;k=1}else if(1==k){print P;k=0;}}
        {s=$3;P=$0}'
    fi
done < scaffolds

I have this output

output

scaffold_0    1    2    1.6
scaffold_0    2    3    1.6
scaffold_10    1    2    1.6
scaffold_100    1    2    1.6

However, I supose if I'm running an awk tolerance of 1 (-v tol=1) I would have just:

output desirable

scaffold_0    1    2    1.6
scaffold_0    2    3    1.6

How could I fix this script in attempt just to have the above output (output desirable)? Could you explain this awk.

Cheers.

grepping something won't be good idea of doing this work..

try this (replace your whole script with this.. :))

 awk '!X[$1]++{print $1}' file > scaffolds
  while read line
  do
  awk -v var="$line" '$1 == var' file| sort -nk2 | awk -v tol="1" '{if(($2-s)<=tol && s){print P;k=1}else if(1==k){print P;k=0;}else{k=0}}  
  {s=$3;P=$0}END{if(1==k){print P}}'
  done<scaffolds

I am working single one liner of awk. will post when i get time..:smiley:

pamu

1 Like

Hey Pamu, great script using awk!!! It worked very well!!! Could you consider to help me one more time?
After to run your script I had a file that I would like to split it since the second column of the next line minus the second column of the current line is <=100

imput

scaffold_100    1    2    10.6
scaffold_100    2    3    4.6
scaffold_100    102    103    5.6
scaffold_100    103    104    6.6
scaffold_100    1000    1001    6.6
scaffold_100    1001    1002    9.6
scaffold_100    3000    3001    6.6
scaffold_100    3002    3003    9.6
scaffold_100    3100    3101    6.6

output one

scaffold_100    1    2    10.6
scaffold_100    2    3    4.6
scaffold_100    102    103    5.6
scaffold_100    103    104    6.6

output two

scaffold_100    1000    1001    6.6
scaffold_100    1001    1002    9.6

output three

scaffold_100    3000    3001    6.6
scaffold_100    3002    3003    9.6
scaffold_100    3100    3101    6.6

I did a lot of things using while and putting things into a bunche of variables but I did not had the correct outputs. Do you know how to do this in a smarter way, using split, csplit, awk or whathever you want?

I'm very gratefull for your help and time.

Cheers.

try

awk -v tol="100" '{if(($2-s)>tol || NR==1){fn="out"++a}}
{s=$2;print > fn}' file

It will create three files as out1,out2,out3