valente
November 7, 2012, 11:27pm
1
Dear all,
I have a imput file like this
imput
scaffold_0 10558458 10558459 1.8
scaffold_0 10558464 10558465 1.75
scaffold_0 10558467 10558468 1.8
scaffold_0 10558468 10558469 1.71428571428571
scaffold_0 10558469 10558470 1.71428571428571
scaffold_0 10558470 10558471 1.71428571428571
scaffold_0 10558471 10558472 1.59090909090909
scaffold_0 10558472 10558473 1.66666666666667
scaffold_0 10558473 10558474 1.75
scaffold_0 10558474 10558475 1.75
scaffold_0 10558475 10558476 1.7
scaffold_0 10558476 10558477 1.7
scaffold_0 10558477 10558478 1.7
scaffold_0 10558478 10558479 1.7
scaffold_0 10558479 10558480 1.7
scaffold_0 10558480 10558481 1.61904761904762
scaffold_0 10577262 10577263 1.6
I would like to retrieve the lines are relative to contiguous number presented in the second column. In this examples, I would have:
output
scaffold_0 10558467 10558468 1.8
scaffold_0 10558468 10558469 1.71428571428571
scaffold_0 10558469 10558470 1.71428571428571
scaffold_0 10558470 10558471 1.71428571428571
scaffold_0 10558471 10558472 1.59090909090909
scaffold_0 10558472 10558473 1.66666666666667
scaffold_0 10558473 10558474 1.75
scaffold_0 10558474 10558475 1.75
scaffold_0 10558475 10558476 1.7
scaffold_0 10558476 10558477 1.7
scaffold_0 10558477 10558478 1.7
scaffold_0 10558478 10558479 1.7
scaffold_0 10558479 10558480 1.7
scaffold_0 10558480 10558481 1.61904761904762
Note that the line "scaffold_0 10558464 10558465 1.75" is not included, because is missing the numbers 10558465 and 10558466. However, I would like to have a tolerance up to five number, which would include that line and others that have a gap up to 5 numbers.
Anybody could help me?
Cheers.
pamu
November 8, 2012, 2:10am
2
Try something like this
sort -nk2 file | awk '{if($2 == s){print P;k=1}else if(1==k){print P;k=0;}}
{s=$3;P=$0}'
To decide tolerance difference use
sort -nk2 file | awk -v tol="5" '{if(($2-s)<=tol){print P;k=1}else if(1==k){print P;k=0;}}
{s=$3;P=$0}'
1 Like
Hi Pamu, thanks for your script. I'm very happy with the results.
I just have a problem that I would like to have your help, because I'm a newbie in these things.
My imput is:
imput
scaffold_0 1 2 1.6
scaffold_0 2 3 1.6
scaffold_0 100 101 1.6
scaffold_0 104 105 1.6
scaffold_100 1 2 1.6
scaffold_100 1000 1001 1.6
scaffold_65 543 544 1.6
scaffold_10 1 2 1.6
scaffold_10 200 201 1.6
scaffold_10 1000 1001 1.6
Runing the next script
script
#!/bin/bash
cat imput |cut -f1 |sort |uniq >scaffolds
wait
while read line
do
one_position=`grep -w -c "$line" teste`
wait
if [ "$one_position" -ne "0" ] #-eq not equal
then
cat imput |grep -w "$line" |sed 's/scaffold_/scaffold_ /g' |sort -nk2 -nk3 |sed 's/scaffold_ /scaffold_/g' | awk -v tol="1" '{if(($2-s)<=tol){print P;k=1}else if(1==k){print P;k=0;}}
{s=$3;P=$0}'
fi
done < scaffolds
I have this output
output
scaffold_0 1 2 1.6
scaffold_0 2 3 1.6
scaffold_10 1 2 1.6
scaffold_100 1 2 1.6
However, I supose if I'm running an awk tolerance of 1 (-v tol=1) I would have just:
output desirable
scaffold_0 1 2 1.6
scaffold_0 2 3 1.6
How could I fix this script in attempt just to have the above output (output desirable)? Could you explain this awk.
Cheers.
pamu
November 9, 2012, 1:16am
4
grepping something won't be good idea of doing this work..
try this (replace your whole script with this.. :))
awk '!X[$1]++{print $1}' file > scaffolds
while read line
do
awk -v var="$line" '$1 == var' file| sort -nk2 | awk -v tol="1" '{if(($2-s)<=tol && s){print P;k=1}else if(1==k){print P;k=0;}else{k=0}}
{s=$3;P=$0}END{if(1==k){print P}}'
done<scaffolds
I am working single one liner of awk. will post when i get time..
pamu
1 Like
valente
November 9, 2012, 10:41pm
5
Hey Pamu, great script using awk!!! It worked very well!!! Could you consider to help me one more time?
After to run your script I had a file that I would like to split it since the second column of the next line minus the second column of the current line is <=100
imput
scaffold_100 1 2 10.6
scaffold_100 2 3 4.6
scaffold_100 102 103 5.6
scaffold_100 103 104 6.6
scaffold_100 1000 1001 6.6
scaffold_100 1001 1002 9.6
scaffold_100 3000 3001 6.6
scaffold_100 3002 3003 9.6
scaffold_100 3100 3101 6.6
output one
scaffold_100 1 2 10.6
scaffold_100 2 3 4.6
scaffold_100 102 103 5.6
scaffold_100 103 104 6.6
output two
scaffold_100 1000 1001 6.6
scaffold_100 1001 1002 9.6
output three
scaffold_100 3000 3001 6.6
scaffold_100 3002 3003 9.6
scaffold_100 3100 3101 6.6
I did a lot of things using while and putting things into a bunche of variables but I did not had the correct outputs. Do you know how to do this in a smarter way, using split, csplit, awk or whathever you want?
I'm very gratefull for your help and time.
Cheers.
pamu
November 10, 2012, 6:14am
6
try
awk -v tol="100" '{if(($2-s)>tol || NR==1){fn="out"++a}}
{s=$2;print > fn}' file
It will create three files as out1,out2,out3