Split and name files as 15 minute periods

Hi,

I have a number of large files (up to 4GB) that I wish to split in 96 parts e.g. one for each 15 minutes of the day. The split can be random so I am using

split -b 42M filename.csv

I want to name each of the resultant files as a distinct 15 minute period of the day e.g.
A20110619.0000-20110619.0015_transaction.csv
A20110619.0015-20110619.0030_transaction.csv

etc.

I am completly stuck and any help would be greatly appreciated.

Thanks
Kieran

What's your system? What's your shell? If you have GNU date, which usually comes with Linux, this may be possible in the shell, otherwise you'll likely need a language like Perl.

NOW_STR=$(date +'%Y-%m-%d') # Or manually set NOW_STR to 2011-01-11 for Jan 11
# -d needs GNU date.
T=$(date -d $NOW_STR +%s) # Start of the day, in epoch seconds
OFF=0

for FILE in x[a-z][a-z] # Matches xaa, xab, etc. that split makes in alphabetical order.
do
        L1="$(date -d "@$T" "+%Y%m%d.%H%M")"
        L2="$(date -d "@$((T+(60*15)))" "+%Y%m%d.%H%M")"
        echo mv "$FILE" A"${L1}-${L2}_transaction.csv"
        ((T += (60*15) )) # Add 15 minutes
        ((OFF += 15))
done

I am confused by your use of split though. You say it's random, but you split at a fixed size?

1 Like

Thanks for the reply.

I am working on AIX 5.3 with ksh. I will try that code.

With regard to split. There is a timestamp in each row of the file but I don't necessarily need to split the file so that the row with timestamp 00:05 goes into the 00:00-00:15 file. I just want to split the file up into 96 even sized files for each 15 minute period of the day. Hence the fixed split size.

Thanks again
Kieran

It won't work. You don't have GNU date.

I'll try to write something in perl.

---------- Post updated at 10:47 AM ---------- Previous update was at 10:42 AM ----------

But that will split strictly on 42 megabyte boundaries, even if that happens to be in the middle of a line.

I just saw that with the split which means I have to fix up each file and probably loose a row.
Is there a better way to split so I get full rows. How difficult is it to split the file based on the timestamp in the row which is of the form:
05/01/2011 00:02:30

I don't have GNU date as you say.

Thanks again for all your help.

#!/usr/bin/perl
# Use:  script.pl YYYY-MM-DD file1 file2 ...
use Time::Local;
use POSIX qw(mktime strftime);

# arg 1 must be YYYY-MM-DD
my ($y, $mon, $d)=split("-", shift);
$mon--; # perl counts months 0-11

my $start=timelocal(0,0,0,$d,$mon,$y);

print "y=$y, mon=$mon, d=$d\n";

while($arg=shift)
{
        $fname=sprintf("A%s-%s.csv\n",
                strftime("%Y%m%d.%H%M",localtime($start)),
                strftime("%Y%m%d.%H%M", localtime($start+(15*60))));

        system("echo mv ${arg} ${fname}");

        $start += ((15*60));
}

exit 0;

---------- Post updated at 11:50 AM ---------- Previous update was at 11:15 AM ----------

This may do it. It reads the file by itself and prints rows into different files as appropriate.

#!/usr/bin/perl
use Time::Local;
use POSIX qw(mktime strftime);

my $start=0, $end=0, $FMT="%Y%m%d.%H%M";

while($line=<STDIN>)
{
        ($mdy, $hms, $rest)=split(" ", $line);
        ($m, $d, $y)=split("/", $mdy);
        ($hour, $min, $sec)=split(":", $hms);

        $ldate=timelocal($sec, $min, $hour, $d, $m-1,$y);
        $min=$min-($min%15);    # chunks of 15 minutes

        if($ldate >= $end)
        {
                $start=timelocal($sec,$min,$hour,$d,$m-1,$y);
                $end=$start + (15*60);
                $fname=sprintf("%s-%s.csv",
                        strftime("$FMT", localtime($start)),
                        strftime("$FMT", localtime($end)));

                FILE && close(FILE);
                open(FILE, ">$fname");
        }

        print FILE "$line";
}

FILE && close(FILE);
exit 0;

Assumes the row begins with M/D/Y H:M:S ...

It'll take the file name from the row data and nothing else.

Use it like ./script.pl < hugefile

It may not be the fastest thing in the universe.

1 Like

Hi,

This looks great. I do get an error when I run it:
> ./script.pl < test.csv
Month '-1' out of range 0..11 at ./script.pl line 13

If I change m-1 to be just m in line 13 I get an error with the day
> ./script.pl < test.csv
Day '' out of range 1..31 at ./script.pl line 13

---------- Post updated at 08:07 AM ---------- Previous update was at 04:37 AM ----------

Hi,

I am actually ok with the first piece of code that uses the fixed size split. I can delete the first 'broken' row from each file. The effect will be minimal for what I need to do.

Thanks for all your help,

Kieran

Your input data's probably not as described. Please post a sample of it.

If y'say so. Your file names will have nothing to do with their contents unless by sheer coincidence.

Ok...Here is some sample data

405038002621334|918602232979||59.161.255.135|59.161.255.67|INTERNET|402|40128|1|1|05/01/2011 00:02:30
405030003291388|919033259968||59.161.255.142|59.161.255.67|INTERNET|402|40128|1|2|05/01/2011 00:02:30
405030000889166|919033829018|1234567898|59.161.255.139|59.161.255.67|INTERNET|402|40128|1|0|05/01/2011 00:02:30

No wonder, my script was expecting the date at the beginning of the line.

#!/usr/bin/perl
use Time::Local;
use POSIX qw(mktime strftime);

my $start=0, $end=0, $FMT="%Y%m%d.%H%M";

while($line=<STDIN>)
{
        # Split each line apart into the default $_ / @_ variable
        split("\\|",$line);
        # Last token is date.  Split apart date and time into two vars.
        ($mdy, $hms)=split(" ", @_[$#_]);
        #Split apart MM/DD/YYYY on /
        ($m, $d, $y)=split("/", $mdy);
        # Split apart HH:MM:SS on :
        ($hour, $min, $sec)=split(":", $hms);

        # Create a timestamp in epoch seconds from it.  $m-1 because Perl counts months 0-11
        $ldate=timelocal($sec, $min, $hour, $d, $m-1,$y);
        $min=$min-($min%15);    # chunks of 15 minutes

        # Check which date range it's in and change the output file as appropriate
        if($ldate >= $end)
        {
                $start=timelocal($sec,$min,$hour,$d,$m-1,$y);
                $end=$start + (15*60);
                $fname=sprintf("%s-%s.csv",
                        strftime("$FMT", localtime($start)),
                        strftime("$FMT", localtime($end)));

                FILE && close(FILE);
                open(FILE, ">$fname");
        }

        print FILE "$line";
}

FILE && close(FILE);
exit 0;

Excellent.

Thanks