Split a file based on number sum at the second column and the third column.

demo10 · March 1, 2020, 9:55pm

Dear all,

I have a bed file below. I want to split the bed file based on base length (2999 kb) between the start and the end position. For example, from the start position 12109 to the end position 14678 should be in one file, as these are in 2999kb range. the start position 15573 and the end position 15612 (2999 bp length from the start position to the end is the splitting condition) should be in another file and so on.
I tried bedtools make windows and bedops chop options but they didn't work.

The input file (has many lines):

Sp_chr1 12109   12149 DNA Sequences 
Sp_chr1 12348   12388 DNA Sequences 
Sp_chr1 12493   12533 DNA Sequences 
Sp_chr1 12616   12656 DNA Sequences 
Sp_chr1 12746   12786 DNA Sequences 
Sp_chr1 14486   14521 DNA Sequences 
Sp_chr1 14525   14564 DNA Sequences  
Sp_chr1 14638   14678 DNA Sequences 
Sp_chr1 15573   15612 DNA Sequences 
Sp_chr1 20498   20538 DNA Sequences 
Sp_chr1 21628   21668 DNA Sequences 
Sp_chr1 25346   25386 DNA Sequences 
Sp_chr1 26053   26093 DNA Sequences 
Sp_chr1 26129   26169 DNA Sequences 
Sp_chr1 27874   27913 DNA Sequences

The desired output files are :
The output file 1:

Sp_chr1 12109   12149 DNA Sequences 
Sp_chr1 12348   12388 DNA Sequences 
Sp_chr1 12493   12533 DNA Sequences 
Sp_chr1 12616   12656 DNA Sequences 
Sp_chr1 12746   12786 DNA Sequences 
Sp_chr1 14486   14521 DNA Sequences 
Sp_chr1 14525   14564 DNA Sequences  
Sp_chr1 14638   14678 DNA Sequences

The output file2:

Sp_chr1 15573   15612 DNA Sequences

The output file3:

Sp_chr1 20498   20538 DNA Sequences 
Sp_chr1 21628   21668 DNA Sequences

and so on.

Chubler_XL · March 1, 2020, 10:32pm

This forum is not a script writing service.

If you have a solution you have worked on that is not complete we can help you, but you must have shown some effort to solve this yourself.

demo10 · March 2, 2020, 12:41am

I tried to do using bedtools makewindows option and bedops chop option, but none of them worked as I need.
This is why I asked here. I know this forum or another forums are not a script writing service. This is the best I can do because my efforts that I know did not worked.

Chubler_XL · March 2, 2020, 1:59am

I can understand that the specifics of what particular tool or language require knowledge you may not have.

How about trying to give us some pseudo code for how this file should be processed.

eg

set startnum=0
set fileext = 1
loop:
    read line from input
     ...
    append line to filename("file" + fileext)
end loop

can you fill in the missing logic for "..." above.

nezabudka · March 2, 2020, 2:29am

Hi
Maybe just like that?

awk '
/^\S+\s+12109/,/^(\S+\s+){2}14678\s/ {print > "file1"}
/^\S+\s+15573/,/^(\S+\s+){2}15612\s/ {print > "file2"}
/^\S+\s+20498/,/^(\S+\s+){2}21668\s/ {print > "file3"}
' file

demo10 · March 2, 2020, 2:58am

set startnum=0
set fileext = 1
loop:
    read line from input
     awk '{ Name= $1; 
         startposition = $2; stopposition = $3; for (startposition = stopposition + 2999); print '{Name}'
    append line to filename("file" + fileext)
end loop

--- Post updated at 08:58 AM ---

Hi,
Thanks. I have a huge file where there are many lines.

awk '
/^\S+\s+12109/,/^(\S+\s+){2}14678\s/ {print > "file1"}
/^\S+\s+15573/,/^(\S+\s+){2}15612\s/ {print > "file2"}
/^\S+\s+20498/,/^(\S+\s+){2}21668\s/ {print > "file3"}
' file

nezabudka · March 2, 2020, 4:07am

maybe so?

#!/bin/bash

step=2999
declare -i start=12109 end=start+step count=1
stop=$(awk '{if($3>max) max=$3} END {print max}' file)

while [ $end -le $stop ]; do
        awk -vA=$start -vZ=$end -vf="file$count" '
                $2>=A && $3<=Z {print > f}
        ' file
        start+=step
        end+=step
        count+=1
done

RudiC · March 2, 2020, 4:20am

Try

awk '
#NR == 1          ||
$3-ST>2999      {FN = "file" ++FCNT
                 ST = $2
                }
                {print  >  FN
                }
 ' file

This assumes the values in $3 start at values higher than 2999; if not so, remove the # before the NR == 1 line

demo10 · March 2, 2020, 5:06am

nezabudka:

Hi
Maybe just like that?

awk '
/^\S+\s+12109/,/^(\S+\s+){2}14678\s/ {print > "file1"}
/^\S+\s+15573/,/^(\S+\s+){2}15612\s/ {print > "file2"}
/^\S+\s+20498/,/^(\S+\s+){2}21668\s/ {print > "file3"}
' file

Thank you so much. This is what I need.

nezabudka:

maybe so?

#!/bin/bash

step=2999
declare -i start=12109 end=start+step count=1
stop=$(awk '{if($3>max) max=$3} END {print max}' file)

while [ $end -le $stop ]; do
   awk -vA=$start -vZ=$end -vf="file$count" '
   $2>=A && $3<=Z {print > f}
   ' file
   start+=step
   end+=step
   count+=1
done

Chubler_XL · March 2, 2020, 2:51pm

Sorry about my delay in gettng back think timezone differences are involved here. looks like we have some nice solutions coming together in this thread now.

This is the pseudo code I had in mind when I first read your requirements:

set startnum=0
set fileext = 1
loop:
    read line from input
    if column#3 - startnum > 2999 then
        startnum = column#2
        fileext = fileext + 1
    endif
    append line to filename("file" + fileext)
end loop

And the awk coded solution:

awk '
BEGIN { start=0 ; filenum=1 }
!start { start=$2 }
($3 - start) > 2999 {
   close("file" filenum)
   filenum++
   start=$2
}
{ print > "file" filenum }' infile

This is very similar to RudiC's proposal. The main difference being in the close statement. awk has a limited number of output buffers and using close will become necessary when dealing with a larger input files which can generate too many output files.

And in the spirit of this site this reduced solution could be derived from above:

awk '
!start || ($3 - start) > 2999 {
   close("file" filenum++)
   start=$2
}
{ print > "file" filenum }' infile