I have a bed file below. I want to split the bed file based on base length (2999 kb) between the start and the end position. For example, from the start position 12109 to the end position 14678 should be in one file, as these are in 2999kb range. the start position 15573 and the end position 15612 (2999 bp length from the start position to the end is the splitting condition) should be in another file and so on.
I tried bedtools make windows and bedops chop options but they didn't work.
The input file (has many lines):
Sp_chr1 12109 12149 DNA Sequences
Sp_chr1 12348 12388 DNA Sequences
Sp_chr1 12493 12533 DNA Sequences
Sp_chr1 12616 12656 DNA Sequences
Sp_chr1 12746 12786 DNA Sequences
Sp_chr1 14486 14521 DNA Sequences
Sp_chr1 14525 14564 DNA Sequences
Sp_chr1 14638 14678 DNA Sequences
Sp_chr1 15573 15612 DNA Sequences
Sp_chr1 20498 20538 DNA Sequences
Sp_chr1 21628 21668 DNA Sequences
Sp_chr1 25346 25386 DNA Sequences
Sp_chr1 26053 26093 DNA Sequences
Sp_chr1 26129 26169 DNA Sequences
Sp_chr1 27874 27913 DNA Sequences
The desired output files are :
The output file 1:
Sp_chr1 12109 12149 DNA Sequences
Sp_chr1 12348 12388 DNA Sequences
Sp_chr1 12493 12533 DNA Sequences
Sp_chr1 12616 12656 DNA Sequences
Sp_chr1 12746 12786 DNA Sequences
Sp_chr1 14486 14521 DNA Sequences
Sp_chr1 14525 14564 DNA Sequences
Sp_chr1 14638 14678 DNA Sequences
The output file2:
Sp_chr1 15573 15612 DNA Sequences
The output file3:
Sp_chr1 20498 20538 DNA Sequences
Sp_chr1 21628 21668 DNA Sequences
I tried to do using bedtools makewindows option and bedops chop option, but none of them worked as I need.
This is why I asked here. I know this forum or another forums are not a script writing service. This is the best I can do because my efforts that I know did not worked.
set startnum=0
set fileext = 1
loop:
read line from input
awk '{ Name= $1;
startposition = $2; stopposition = $3; for (startposition = stopposition + 2999); print '{Name}'
append line to filename("file" + fileext)
end loop
--- Post updated at 08:58 AM ---
Hi,
Thanks. I have a huge file where there are many lines.
Sorry about my delay in gettng back think timezone differences are involved here. looks like we have some nice solutions coming together in this thread now.
This is the pseudo code I had in mind when I first read your requirements:
set startnum=0
set fileext = 1
loop:
read line from input
if column#3 - startnum > 2999 then
startnum = column#2
fileext = fileext + 1
endif
append line to filename("file" + fileext)
end loop
This is very similar to RudiC's proposal. The main difference being in the close statement. awk has a limited number of output buffers and using close will become necessary when dealing with a larger input files which can generate too many output files.
And in the spirit of this site this reduced solution could be derived from above: