I have an input file (more than 20K records) as following. The information I'm interested to manipulate are at column 10, 11 and 13.
Column 13: It's item name, item name may appear more than once in the table.
Column 10: A string of "start position" seperated by comma.
Column 11: A string of "end position" seperated by comma.
Start End Item
90098643,90152028,90178267 90098881,90152170,90185093 B1
76540388,76779489,76877692 76540569,76779684,76878102 B2
76540388,76779489,76877692 76540569,76779684,76878102 B2
90098643,90178260 90098890, 90185093 B1
I'm would like to find overlapping regions for each item.
Output:
Item Start End
B1 90098643,90152028,90178260 90098890,90152170,90185093
B2 76540388,76779489,76877692 76540569,76779684,76878102
By literally, overlapping regions of B1 are (90098643-90098890,90152028-90152170,90178260-90185093)
The following is my script,
{a[$13]++
start[$13]=start[$13] "" $10
end[$13]=end[$13] "" $11
}
END {
for (i in a){
split(start,split_start,",")
split(end,split_end,",")
mylen=length(split_start)
for (k=1;k<mylen;k++){
if (split_start[k]"+"split_end[k] in mypair) continue
else {
mypair[split_start[k]"+"split_end[k]]++
if (k==1) mystring=split_start[k]"+"split_end[k]
else mystring=mystring","split_start[k]"+"split_end[k]
}
}
split(mystring,mylist,",")
asort(mylist)
count=length(mylist)
#---------finding overlapping regions
ind=0
for (z=1;z<=count;z++){
split(mylist[z],item,"+")
if (z==1){
ind+=1
unionlist[ind]=mylist[z]
}else{
split(unionlist[ind],old,"+")
aa=old[1]
bb=old[2]
cc=item[1]
dd=item[2]
if (cc>bb){
ind+=1
unionlist[ind]=cc"+"dd
}else if (cc>=aa && cc<bb){
if (dd>bb){ unionlist[ind]=aa"+"dd}
else {unionlist[ind]=aa"+"bb}
}
}
}
for (j=1;j<=ind;j++) mystring3=mystring3","unionlist[j]
print i,length(unionlist),mystring3
delete mypair
delete unionlist
delete mylist
mystring3=""
}}
From the above script, you would see I store string of regions in this format, (90098643+90098890,90152028+90152170,90178260+90185093).
I want to sort them in ascending order so that it will ease the finding of overlapping region. The function asort() is not appropriate in the following case,
33+54
11+34
22+33
222+456
The output sorted region would be,
11+34
22+33
222+456
33+54
The order is incorrect as 222+456 should have positioned at last.
I'm sure that the part finding overlapping region is correct, I tested it with another programming language. Now the only problem I have is the sorting part.
Does anyone could suggest me to sort a 2 dimentional array?
Thanks,
phoebe