Split a large file

CRGreathouse · August 11, 2010, 9:17pm

I have a 3 GB text file that I would like to split. How can I do this?

It's a giant comma-separated list of numbers. I would like to make it into about 20 files of ~100 MB each, with a custom header and footer. The file can only be split on commas, but they're plentiful.

Something like this:

file:
1,2,3,4,...,12500000,12500001,...

to

file1:
v=[1,2,3,4,...,12500000];

file2:
v=[12500001,...];

...

file20:
v=[250000001,...];

where the header is "v=[" and the footer is "];". (The comma between, e.g., 12500000 and 12500001 is dropped.)

alister · August 11, 2010, 11:17pm

commasplit.awk:

BEGIN {
    RS = ","
    max_size = 100*2^20
}

function open_file() {
    len = 0
    fn = "file" ++i
    printf("v=[") > fn
}

function close_file() {
    printf("];") > fn
    close(fn)
}

NR == 1 {
    open_file()
}

len >= max_size {
    close_file()
    open_file()
}

{
    s = (len?",":"") $0
    printf("%s", s) > fn
    len += length(s)
}

END {
   close_file()
}

Invocation:

awk -f commasplit.awk  datafile

max_size is not a hard limit. The file size may be a bit larger, as much as max_size + length of header (3) and footer (2) + one field length - 1. If the field's value were "125000" (6), we'd be talking about a file size of 100 MiB + 10.

Regards,
Alister

Ygor · August 11, 2010, 11:23pm

Nice. This script splits the input file into 20 output files...

gawk 'BEGIN {
          RS = ","
          "ls -l " ARGV[ARGC-1] | getline
          s = int( $6 / 20 )
      }
      n==0 {
          if (f) {
              print "];" > f
              close(f)
          }
          f = sprintf("outfile.%03d", ++c)
          printf "v=[" $1 > f
          n = 3 + length
          next
      }
      {
          printf "," $1 > f
          n = n + length + 1
          if(n > s){
              n = 0
          }
      }
      END {
          print "];" > f
      }' infile

CRGreathouse · August 12, 2010, 12:32am

Thanks!