Modification of perl script to split a large file into chunks of 5000 chracters

gimley · May 8, 2018, 10:34pm

I have a perl script which splits a large file into chunks.The script is given below

use strict;
use warnings;
open (FH, "<monolingual.txt") or die "Could not open source file. $!";
my $i = 0;
while (1) {
    my $chunk;
	print "process part $i\n";
	open(OUT, ">part$i.log") or die "Could not open destination file";
	$i ++;
	if (!eof(FH)) {
		read(FH, $chunk, 5000);
		print OUT $chunk;
	} 
	if (!eof(FH)) {
		$chunk = <FH>;
		print OUT $chunk;
	}
	close(OUT);
	last if eof(FH);
}

I want the script to create chunks of 5000 characters or a bit less but not more than that.
How do I modify the chunk size to ensure that each chunk is of 5000 characters. When I run it some chunks are more than 5000 characters.
Many thanks for your kind help

jim_mcnamara · May 8, 2018, 10:48pm

As an aside, there is a split command that does exactly what you ask.

 split -b [size in bytes ] infile [option control outfile naming]

Linux man page:

split(1) - Linux manual page

gimley · May 8, 2018, 10:54pm

Thanks a lot. Excuse my ignorance but how many bytes do I allocate ?
My data is in UTF8 format and if I want to ensure that 5000 characters are chunked, what would be the byte size. In ASCII format it would be just 1 but in UTF8 I find that the byte size varies.

jim_mcnamara · May 8, 2018, 11:46pm

That may also be why your perl has issues as well. UTF8 characters encode all of Unicode 1,112,064 characters, so a UTF8 character may be 8, 16, 24, or 32 bits.

To fix perl will require the understanding of wide characters, a locale based "datatype", sort of. Help is here:
Perl Programming/Unicode UTF-8 - Wikibooks, open books for an open world

Recent linux awk version 4.2 onward splits UTF8 encoded records into fields using wide characters, -a forces the split to be created and placed in the $F array. Here is a perl sample and an awk sample that do the same thing on UTF8 files.

perl -CSD -aF'\N{U+1f4a9}' -nle 'print $F[0]' somefile.txt  # $F[0] is the same as awk's $1 variable

awk -F$'\U0001f4a9' '{print $1}' somefile.txt  # or $'\u007c' for 4-digit code points

code point is a delimiter. All of this is explained in the link.

gimley · May 9, 2018, 2:45am

Thanks a lot for your kind help. I now understand why my PERL script goofed up also.

---------- Post updated 05-09-18 at 01:45 AM ---------- Previous update was 05-08-18 at 10:49 PM ----------

Hello,
I found an easier method which accommodates words. Am posting it in case someone meets a similar problem

csplit filename /([\w.,;]+\s+){5000}/

I set it for 5000 words but it can be set for any number.