I have a 3 GB text file that I would like to split. How can I do this?
It's a giant comma-separated list of numbers. I would like to make it into about 20 files of ~100 MB each, with a custom header and footer. The file can only be split on commas, but they're plentiful.
Something like this:
file:
1,2,3,4,...,12500000,12500001,...
to
file1:
v=[1,2,3,4,...,12500000];
file2:
v=[12500001,...];
...
file20:
v=[250000001,...];
where the header is "v=[" and the footer is "];". (The comma between, e.g., 12500000 and 12500001 is dropped.)
BEGIN {
RS = ","
max_size = 100*2^20
}
function open_file() {
len = 0
fn = "file" ++i
printf("v=[") > fn
}
function close_file() {
printf("];") > fn
close(fn)
}
NR == 1 {
open_file()
}
len >= max_size {
close_file()
open_file()
}
{
s = (len?",":"") $0
printf("%s", s) > fn
len += length(s)
}
END {
close_file()
}
Invocation:
awk -f commasplit.awk datafile
max_size is not a hard limit. The file size may be a bit larger, as much as max_size + length of header (3) and footer (2) + one field length - 1. If the field's value were "125000" (6), we'd be talking about a file size of 100 MiB + 10.