raj_k
January 7, 2014, 5:57am
1
I have file like this:
chr start end
chr15 99874874 99875874 chr15 99875173 99876173 aa1
chr15 99874923 99875923 chr15 99875173 99876173 aa1
chr15 99874962 99875962 chr15 99875173 99876173 aa1
chr1 10834962 10835962 chr3 5674767 5675545 ahc1
what i want t o do is for the same chromosome (column 1) if start posiiton falls with in 1000bp of the next entries and if the column 4 5 6and 7 remain are same i want to remove those entries and keep only the first entry
for example here
chr15 99874874 99875874 chr15 99875173 99876173 aa1
chr15 99874923 99875923 chr15 99875173 99876173 aa1
chr15 99874962 99875962 chr15 99875173 99876173 aa1
the start position second column varies by few bp and the 4, 5, 6 and 7 columns are same so i want t o retain only
chr15 99874874 99875874 chr15 99875173 99876173 aa1
chr1 10834962 10835962 chr3 5674767 5675545 ahc1
Try : [Not Tested]
$ awk 'p && $2-p<=1000 && !x[$4$5$6$7]++{print last}{p=$2;last=$0}' file
raj_k
January 7, 2014, 12:30pm
3
hi
its giving output something like this:
chr15 99874874 99875874 chr15 99875173 99876173 aa1
chr15 99874962 99875962 chr15 99875173 99876173 aa1
but the desired output that i mentioned is not this
This one only checks between adjacent lines
awk '{x=$1 FS $4 FS $5 FS $6 FS $7} (NR>1 && !($2-p2<=1000 && x==px)) {print} {px=x; p2=$2}' file
awk ' NR==1{
next
}
function out(){
if(p && $2-p<=1000 && c==0)
print last
}
{
out()
}
{
last=$0
c=x[$4$5$6$7]++
p=$2
}
END{
out()
}
' file
chr15 99874874 99875874 chr15 99875173 99876173 aa1
chr1 10834962 10835962 chr3 5674767 5675545 ahc1
Akshay, I have understood the requirement was equal columns $1 and $4 $5 $6 $7.?
At least the comparison string should be field-separated x[$4 FS $5 FS $6 FS $7]
,
so e.g. ab cd ef gh
does not match a bc de fg h
raj_k
January 9, 2014, 11:44am
7
hi akshay
If i use your code on this data set
chr11 87578121 87579121 chr11 87578115 87579115 ID1
chr11 87578193 87579193 chr11 87578115 87579115 ID1
chr11 87578208 87579208 chr11 87578115 87579115 ID1
chr11 75966214 75967214 chr11 75966112 75967112 ID2
chr11 75966257 75967257 chr11 75966112 75967112 ID2
chr7 122066072 122067072 chr7 122067871 122068871 ID3
chr7 122067133 122068133 chr7 122067871 122068871 ID3
chr7 122067156 122068156 chr7 122067871 122068871 Id3
chr15 66968646 66969646 chr15 67413704 67414704 ID4
chr15 66968646 66969646 chr15 67413872 67414872 ID4
the output is as follows:
chr11 87578193 87579193 chr11 87578115 87579115 ID1
chr11 75966214 75967214 chr11 75966112 75967112 ID2
chr15 66968646 66969646 chr15 67413704 67414704 ID4
chr15 66968646 66969646 chr15 67413872 67414872 ID4
It is supposed to be
chr11 87578193 87579193 chr11 87578115 87579115 ID1
chr11 75966214 75967214 chr11 75966112 75967112 ID2
chr7 122066072 122067072 chr7 122067871 122068871 ID3
chr7 122067133 122068133 chr7 122067871 122068871 ID3
chr15 66968646 66969646 chr15 67413704 67414704 ID4
chr15 66968646 66969646 chr15 67413872 67414872 ID4
i dont know why it is skipping those lines which also fit into the condition
---------- Post updated at 11:44 AM ---------- Previous update was at 09:21 AM ----------
@madeingermany
i have modified
NR>1
to
NR>=1
because every time its producing output it is not considering the first 3 lines in my example.
You are right.
(NR>=1) is always true, so you can simplify
awk '{x=$1 FS $4 FS $5 FS $6 FS $7} !($2-p2<=1000 && x==px) {print} {px=x; p2=$2}' file