Problem running Perl Script with huge data files

ad23 · July 8, 2010, 4:24pm

Hello Everyone,

I have a perl script that reads two types of data files (txt and XML). These data files are huge and large in number. I am using something like this :

foreach my $t  (@text)
{
    open TEXT, $t or die "Cannot open $t for reading: $!\n";
  
    while(my $line=<TEXT>){
       ....My code....
    }
}
close(TEXT);

foreach my $x  (@xml)
{
    open XML, $x or die "Cannot open $x for reading: $!\n";
  
    while(my $line=<XML>){
       ....My code....
    }
}

When I run it directly like following, it gives me "Out of memory" error:

Usage: perl runXML.pl

Can anyone suggest me as to how I can run this using "qsub" or something? I have these files in a directory structure like this:

/Data/2010_aaa/data.txt
/Data/2010_aaa/data.xml

/Data/2010_bbb/data.txt
/Data/2010_bbb/data.xml

/Data/2010_ccc/data.txt
/Data/2010_ccc/data.xml

/Data/2010_ddd/data.txt
/Data/2010_ddd/data.xml

And I need to run this script on all these files, as my data is scattered in these files.

Thanks!

jim_mcnamara · July 8, 2010, 5:02pm

What does

ulimit -a

show?

ad23 · July 9, 2010, 8:54am

ulimit -a

is working only for one of my Solaris servers. Its not working on the 2nd one. It asks whether "ulimit" is:

CORRECT>unlimit -a (y|n|e|a)? yes
Usage: unlimit [-fh] [limits].

Using the other solaris server, it gives me the following:

time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        8192
coredump(blocks)     unlimited
nofiles(descriptors) 256
vmemory(kbytes)      unlimited

Thanks!

jim_mcnamara · July 9, 2010, 9:39am

You are probably exceeding the limit of virtual memory. You must be keeping an array that grows without bounds.

If the files are really big, like > 2GB, consider asking the sysadmin to add more swap space. I personally believe that showing more of your code would help more than adding swap space. I think you are hogging memory in your code.

edstertech · July 9, 2010, 5:41pm

Depending on the task, you could always break the script into 3 scripts:

Script 1 (control script):

foreach my $t  (@text)
{
    system(text_script.pl $t);
}

foreach my $x  (@xml)
{
    system(xml_script.pl $x);
}

Script 2 (text file processing script):

my $t = $ARGV[0];
open TEXT, $t or die "Cannot open $t for reading: $!\n";
  
while(my $line=<TEXT>){
   ....My code....
}
close(TEXT);

Script 2 (text file processing script):

my $x = $ARGV[0];
open XML, $x or die "Cannot open $x for reading: $!\n";
  
while(my $line=<XML>){
   ....My code....
}
close(XML);

This *may* be easier to debug and maintain as well.

Of course this approach won't work if you're trying to collect everything from all the files before doing any data processing...
...but surely you're not trying to do that ...?

Ed