Bin iteratively based on each row

and_y · August 25, 2013, 12:05pm

Hi
I have a file some thing like below. I want to bin the data. My Bin size is 100.

items	number
HELIX1	75 
HELIX6	160    
HELIX2	88    
HELIX19	114   
HELIX5	61    
HELIX4	167

it should consider each elemet under the number column and bin all the lines like below with 100 as an interval. The third column should represent the bin range

bin_75-175.txt
HELIX1	75	75-175	   
HELIX2	88	75-175    
HELIX19	114	75-175
HELIX6  160	75-175 
HELIX4	167	75-175

bin_160-260.txt
HELIX6  	160	160-260	
HELIX4       167	160-260

bin_88-188.txt
HELIX2	88	88-188		   
HELIX19	114	88-188
HELIX6  160	88-188	
HELIX4	167	88-188

and so on based on the number of rows in inpit file

when i searched in the forum i found some thing like this

awk '{f=sprintf("%d", 1+$2/100); fn[f]="bin"f"_"; print $1,$2>>fn[f]}' file_name

which bins from 0-100 and so on

RudiC · August 25, 2013, 12:30pm

Not sure where you want to start your bins? At 0? at 75? at 88?

and_y · August 25, 2013, 12:42pm

The first element in the number column is 75 so it should bin from 75-175; second element is 160, so bin elements that occur between 160-260 ; and third element is 88 so it has to bin 88-188; fourth element is 114 so 114-214 and so on for each element in the number column.

Scrutinizer · August 25, 2013, 12:51pm

Perhaps something like this:

awk '
  NR==1{
    next
  }
  NR==FNR{
    B[$2]=$2+100
    next
  }
  {
    for(i in B) {
      r=i "-" B
      if ( i+0<=$2+0 && $2+0 < B+0 ) print $0, r > ( "bin_" r )
    }
  }
' OFS='\t' file file

and_y · August 25, 2013, 1:18pm

For my trial data set the code provided is not producing any result i am using cygwin

Scrutinizer · August 25, 2013, 1:25pm

I corrected it in my post. The input file needs to be specified twice..

and_y · August 25, 2013, 1:38pm

Thank you it works perfectly could you please explain how it works

Scrutinizer · August 25, 2013, 6:13pm

I'll try:

awk '
  NR==1{                                                               # skip the header record
    next
  }
  NR==FNR{                                                             # when reading the file for the first time ( that is when NR equals FNR )
    B[$2]=$2+100                                                       # create a representation of the bins in the form of arrays, witch index $2 and value $2 + 100
    next                                                               # do not process the rest which is meant for the second time the file is read
  }
  {                                                                    # process the file for the second time
    for(i in B) {                                                      # for each index in the bins
      r=i "-" B                                                     # compose the string that represents the bin's range
      if ( i+0<=$2+0 && $2+0 < B+0 ) print $0, r > ( "bin_" r )     # if $2 is witin the bin's range then print to the corresponding file the record and the range to the corresponding file
    }
  }
' OFS='\t' file file                                                   # use a tab to separate the record range. Read file twice, once for the bins second for the output.

--
note: If there are too many bin files, close() statements will need to added to intermediately close file, otherwise there will be "too many files open" errors.