Remove duplicate entries based on the range

raj_k · January 7, 2014, 5:57am

I have file like this:

chr	start	end	
chr15   99874874         99875874       chr15   99875173        99876173	aa1		
chr15   99874923         99875923       chr15   99875173        99876173	aa1
chr15   99874962         99875962       chr15   99875173        99876173	aa1
chr1   10834962	10835962	chr3	5674767	5675545         	ahc1

what i want t o do is for the same chromosome (column 1) if start posiiton falls with in 1000bp of the next entries and if the column 4 5 6and 7 remain are same i want to remove those entries and keep only the first entry
for example here

chr15   99874874         99875874       chr15   99875173        99876173	aa1		
chr15   99874923         99875923       chr15   99875173        99876173	aa1
chr15   99874962         99875962       chr15   99875173        99876173	aa1

the start position second column varies by few bp and the 4, 5, 6 and 7 columns are same so i want t o retain only

chr15   99874874         99875874       chr15   99875173        99876173	aa1
chr1   10834962	10835962	chr3	5674767	5675545         	ahc1

Akshay_Hegde · January 7, 2014, 6:51am

Try : [Not Tested]

$ awk 'p && $2-p<=1000 && !x[$4$5$6$7]++{print last}{p=$2;last=$0}' file

raj_k · January 7, 2014, 12:30pm

hi
its giving output something like this:

chr15   99874874         99875874       chr15   99875173        99876173        aa1
chr15   99874962         99875962       chr15   99875173        99876173        aa1

but the desired output that i mentioned is not this

MadeInGermany · January 7, 2014, 1:09pm

This one only checks between adjacent lines

awk '{x=$1 FS $4 FS $5 FS $6 FS $7} (NR>1 && !($2-p2<=1000 && x==px)) {print} {px=x; p2=$2}' file

Akshay_Hegde · January 7, 2014, 1:11pm

awk '      NR==1{
                 next
                }
  function out(){
                   if(p && $2-p<=1000 && c==0)
                   print last
                }
                {
                 out()
                }
                {
                 last=$0
                 c=x[$4$5$6$7]++
                 p=$2
                }
             END{
                 out()
                }
    ' file

chr15   99874874         99875874       chr15   99875173        99876173    aa1        
chr1   10834962    10835962    chr3    5674767    5675545             ahc1

MadeInGermany · January 7, 2014, 1:20pm

Akshay, I have understood the requirement was equal columns $1 and $4 $5 $6 $7.?
At least the comparison string should be field-separated x[$4 FS $5 FS $6 FS $7] ,
so e.g. ab cd ef gh does not match a bc de fg h

raj_k · January 9, 2014, 11:44am

hi akshay
If i use your code on this data set

chr11   87578121         87579121       chr11   87578115        87579115	ID1        
chr11   87578193         87579193       chr11   87578115        87579115	ID1       
chr11   87578208         87579208       chr11   87578115        87579115	ID1        
chr11   75966214         75967214       chr11   75966112        75967112	ID2        
chr11   75966257         75967257       chr11   75966112        75967112	ID2       
chr7    122066072        122067072      chr7    122067871       122068871	ID3      
chr7    122067133        122068133      chr7    122067871       122068871	ID3      
chr7    122067156        122068156      chr7    122067871       122068871	Id3     
chr15   66968646         66969646       chr15   67413704        67414704	ID4        
chr15   66968646         66969646       chr15   67413872        67414872	ID4

the output is as follows:

chr11   87578193         87579193       chr11   87578115        87579115	ID1       
chr11   75966214         75967214       chr11   75966112        75967112	ID2       
chr15   66968646         66969646       chr15   67413704        67414704	ID4       
chr15   66968646         66969646       chr15   67413872        67414872	ID4

It is supposed to be

chr11   87578193         87579193       chr11   87578115        87579115	ID1       
chr11   75966214         75967214       chr11   75966112        75967112	ID2 
chr7    122066072        122067072      chr7    122067871       122068871	ID3      
chr7    122067133        122068133      chr7    122067871       122068871	ID3      
chr15   66968646         66969646       chr15   67413704        67414704	ID4       
chr15   66968646         66969646       chr15   67413872        67414872	ID4

i dont know why it is skipping those lines which also fit into the condition

---------- Post updated at 11:44 AM ---------- Previous update was at 09:21 AM ----------

@madeingermany
i have modified

NR>1

to

 NR>=1

because every time its producing output it is not considering the first 3 lines in my example.

MadeInGermany · January 10, 2014, 12:15pm

You are right.
(NR>=1) is always true, so you can simplify

awk '{x=$1 FS $4 FS $5 FS $6 FS $7} !($2-p2<=1000 && x==px) {print} {px=x; p2=$2}' file