I have a perl script which splits a large file into chunks.The script is given below
use strict;
use warnings;
open (FH, "<monolingual.txt") or die "Could not open source file. $!";
my $i = 0;
while (1) {
my $chunk;
print "process part $i\n";
open(OUT, ">part$i.log") or die "Could not open destination file";
$i ++;
if (!eof(FH)) {
read(FH, $chunk, 5000);
print OUT $chunk;
}
if (!eof(FH)) {
$chunk = <FH>;
print OUT $chunk;
}
close(OUT);
last if eof(FH);
}
I want the script to create chunks of 5000 characters or a bit less but not more than that.
How do I modify the chunk size to ensure that each chunk is of 5000 characters. When I run it some chunks are more than 5000 characters.
Many thanks for your kind help
As an aside, there is a split command that does exactly what you ask.
split -b [size in bytes ] infile [option control outfile naming]
Linux man page:
split(1) - Linux manual page
1 Like
Thanks a lot. Excuse my ignorance but how many bytes do I allocate ?
My data is in UTF8 format and if I want to ensure that 5000 characters are chunked, what would be the byte size. In ASCII format it would be just 1 but in UTF8 I find that the byte size varies.
That may also be why your perl has issues as well. UTF8 characters encode all of Unicode 1,112,064 characters, so a UTF8 character may be 8, 16, 24, or 32 bits.
To fix perl will require the understanding of wide characters, a locale based "datatype", sort of. Help is here:
Perl Programming/Unicode UTF-8 - Wikibooks, open books for an open world
Recent linux awk version 4.2 onward splits UTF8 encoded records into fields using wide characters, -a forces the split to be created and placed in the $F array. Here is a perl sample and an awk sample that do the same thing on UTF8 files.
perl -CSD -aF'\N{U+1f4a9}' -nle 'print $F[0]' somefile.txt # $F[0] is the same as awk's $1 variable
awk -F$'\U0001f4a9' '{print $1}' somefile.txt # or $'\u007c' for 4-digit code points
code point is a delimiter. All of this is explained in the link.
1 Like
Thanks a lot for your kind help. I now understand why my PERL script goofed up also.
---------- Post updated 05-09-18 at 01:45 AM ---------- Previous update was 05-08-18 at 10:49 PM ----------
Hello,
I found an easier method which accommodates words. Am posting it in case someone meets a similar problem
csplit filename /([\w.,;]+\s+){5000}/
I set it for 5000 words but it can be set for any number.