Ravi.K
January 29, 2017, 4:23am
1
Hi,
I have received a file which is 20 GB. We would like to split the file into 4 equal parts and process it to avoid memory issues.
If the record delimiter is unix new line, I could use split command either with option l or b.
The problem is that the line terminator is |##|
How to use split command in this regards or do we have any other alternative?
Thanks
If you have GNU awk
or mawk
, you could try first converting it to regular newline termination like so:
gawk 1 RS='[|]##[|]' file > newfile
Perhaps that would solve your memory issues as well.
Ravi.K
January 29, 2017, 4:44am
3
Hi, My data has unix new line terminators in the data lelvel. I should not convert the line terminator. I have to split the file into equal parts with out line terminator conversion
Could you post a sample of your input file?
rdrtx1
January 30, 2017, 6:00pm
5
awk 'NR==FNR {records=NR; next}
FNR==1 {
Split=(records % Split) ? (int(records/Split)+1) : (records/Split);
split_file="split_file1";
}
{
printf $0 RS > split_file;
if (! (FNR % Split)) {
if (split_file) close(split_file);
split_file="split_file" 1 + ++file_c;
}
}
' RS="[|]##[|]" datafile Split=4 datafile
Ravi.K
January 31, 2017, 4:31am
6
Here is format of the file.
|
- field delimiter
|##|
- record delimiter
100|COL1|COL2|COL3|##|200|COL1|COL2|COL3|##|300|COL1|COL2|COL3|##|400|COL1|COL2|COL3|##|500|COL1|COL2|COL3
Thanks