How to split large file with different record delimiter?

Ravi.K · January 29, 2017, 4:23am

Hi,
I have received a file which is 20 GB. We would like to split the file into 4 equal parts and process it to avoid memory issues.

If the record delimiter is unix new line, I could use split command either with option l or b.

The problem is that the line terminator is |##|
How to use split command in this regards or do we have any other alternative?
Thanks

Scrutinizer · January 29, 2017, 4:34am

If you have GNU awk or mawk , you could try first converting it to regular newline termination like so:

gawk 1 RS='[|]##[|]' file > newfile

Perhaps that would solve your memory issues as well.

Ravi.K · January 29, 2017, 4:44am

Hi, My data has unix new line terminators in the data lelvel. I should not convert the line terminator. I have to split the file into equal parts with out line terminator conversion

Scrutinizer · January 29, 2017, 6:08am

Could you post a sample of your input file?

rdrtx1 · January 30, 2017, 6:00pm

awk 'NR==FNR {records=NR; next}
FNR==1 {
   Split=(records % Split) ? (int(records/Split)+1) : (records/Split);
   split_file="split_file1";
}
{
 printf $0 RS > split_file;
 if (! (FNR % Split)) {
   if (split_file) close(split_file);
   split_file="split_file" 1 + ++file_c;
 }
}
' RS="[|]##[|]" datafile Split=4 datafile

Ravi.K · January 31, 2017, 4:31am

Here is format of the file.

| - field delimiter
|##| - record delimiter

100|COL1|COL2|COL3|##|200|COL1|COL2|COL3|##|300|COL1|COL2|COL3|##|400|COL1|COL2|COL3|##|500|COL1|COL2|COL3

Thanks