Split and name files as 15 minute periods

ksexton · June 22, 2011, 11:14am

Hi,

I have a number of large files (up to 4GB) that I wish to split in 96 parts e.g. one for each 15 minutes of the day. The split can be random so I am using

split -b 42M filename.csv

I want to name each of the resultant files as a distinct 15 minute period of the day e.g.
A20110619.0000-20110619.0015_transaction.csv
A20110619.0015-20110619.0030_transaction.csv

etc.

I am completly stuck and any help would be greatly appreciated.

Thanks
Kieran

Corona688 · June 22, 2011, 12:11pm

What's your system? What's your shell? If you have GNU date, which usually comes with Linux, this may be possible in the shell, otherwise you'll likely need a language like Perl.

NOW_STR=$(date +'%Y-%m-%d') # Or manually set NOW_STR to 2011-01-11 for Jan 11
# -d needs GNU date.
T=$(date -d $NOW_STR +%s) # Start of the day, in epoch seconds
OFF=0

for FILE in x[a-z][a-z] # Matches xaa, xab, etc. that split makes in alphabetical order.
do
        L1="$(date -d "@$T" "+%Y%m%d.%H%M")"
        L2="$(date -d "@$((T+(60*15)))" "+%Y%m%d.%H%M")"
        echo mv "$FILE" A"${L1}-${L2}_transaction.csv"
        ((T += (60*15) )) # Add 15 minutes
        ((OFF += 15))
done

I am confused by your use of split though. You say it's random, but you split at a fixed size?

ksexton · June 22, 2011, 12:36pm

Thanks for the reply.

I am working on AIX 5.3 with ksh. I will try that code.

With regard to split. There is a timestamp in each row of the file but I don't necessarily need to split the file so that the row with timestamp 00:05 goes into the 00:00-00:15 file. I just want to split the file up into 96 even sized files for each 15 minute period of the day. Hence the fixed split size.

Thanks again
Kieran

Corona688 · June 22, 2011, 12:47pm

It won't work. You don't have GNU date.

I'll try to write something in perl.

---------- Post updated at 10:47 AM ---------- Previous update was at 10:42 AM ----------

But that will split strictly on 42 megabyte boundaries, even if that happens to be in the middle of a line.

ksexton · June 22, 2011, 12:52pm

I just saw that with the split which means I have to fix up each file and probably loose a row.
Is there a better way to split so I get full rows. How difficult is it to split the file based on the timestamp in the row which is of the form:
05/01/2011 00:02:30

I don't have GNU date as you say.

Thanks again for all your help.

Corona688 · June 22, 2011, 1:50pm

#!/usr/bin/perl
# Use:  script.pl YYYY-MM-DD file1 file2 ...
use Time::Local;
use POSIX qw(mktime strftime);

# arg 1 must be YYYY-MM-DD
my ($y, $mon, $d)=split("-", shift);
$mon--; # perl counts months 0-11

my $start=timelocal(0,0,0,$d,$mon,$y);

print "y=$y, mon=$mon, d=$d\n";

while($arg=shift)
{
        $fname=sprintf("A%s-%s.csv\n",
                strftime("%Y%m%d.%H%M",localtime($start)),
                strftime("%Y%m%d.%H%M", localtime($start+(15*60))));

        system("echo mv ${arg} ${fname}");

        $start += ((15*60));
}

exit 0;

---------- Post updated at 11:50 AM ---------- Previous update was at 11:15 AM ----------

This may do it. It reads the file by itself and prints rows into different files as appropriate.

#!/usr/bin/perl
use Time::Local;
use POSIX qw(mktime strftime);

my $start=0, $end=0, $FMT="%Y%m%d.%H%M";

while($line=<STDIN>)
{
        ($mdy, $hms, $rest)=split(" ", $line);
        ($m, $d, $y)=split("/", $mdy);
        ($hour, $min, $sec)=split(":", $hms);

        $ldate=timelocal($sec, $min, $hour, $d, $m-1,$y);
        $min=$min-($min%15);    # chunks of 15 minutes

        if($ldate >= $end)
        {
                $start=timelocal($sec,$min,$hour,$d,$m-1,$y);
                $end=$start + (15*60);
                $fname=sprintf("%s-%s.csv",
                        strftime("$FMT", localtime($start)),
                        strftime("$FMT", localtime($end)));

                FILE && close(FILE);
                open(FILE, ">$fname");
        }

        print FILE "$line";
}

FILE && close(FILE);
exit 0;

Assumes the row begins with M/D/Y H:M:S ...

It'll take the file name from the row data and nothing else.

Use it like ./script.pl < hugefile

It may not be the fastest thing in the universe.

ksexton · June 23, 2011, 9:07am

Hi,

This looks great. I do get an error when I run it:
> ./script.pl < test.csv
Month '-1' out of range 0..11 at ./script.pl line 13

If I change m-1 to be just m in line 13 I get an error with the day
> ./script.pl < test.csv
Day '' out of range 1..31 at ./script.pl line 13

---------- Post updated at 08:07 AM ---------- Previous update was at 04:37 AM ----------

Hi,

I am actually ok with the first piece of code that uses the fixed size split. I can delete the first 'broken' row from each file. The effect will be minimal for what I need to do.

Thanks for all your help,

Kieran

Corona688 · June 23, 2011, 12:10pm

Your input data's probably not as described. Please post a sample of it.

If y'say so. Your file names will have nothing to do with their contents unless by sheer coincidence.

ksexton · June 23, 2011, 12:17pm

Ok...Here is some sample data

405038002621334|918602232979||59.161.255.135|59.161.255.67|INTERNET|402|40128|1|1|05/01/2011 00:02:30
405030003291388|919033259968||59.161.255.142|59.161.255.67|INTERNET|402|40128|1|2|05/01/2011 00:02:30
405030000889166|919033829018|1234567898|59.161.255.139|59.161.255.67|INTERNET|402|40128|1|0|05/01/2011 00:02:30

Corona688 · June 23, 2011, 12:32pm

No wonder, my script was expecting the date at the beginning of the line.

#!/usr/bin/perl
use Time::Local;
use POSIX qw(mktime strftime);

my $start=0, $end=0, $FMT="%Y%m%d.%H%M";

while($line=<STDIN>)
{
        # Split each line apart into the default $_ / @_ variable
        split("\\|",$line);
        # Last token is date.  Split apart date and time into two vars.
        ($mdy, $hms)=split(" ", @_[$#_]);
        #Split apart MM/DD/YYYY on /
        ($m, $d, $y)=split("/", $mdy);
        # Split apart HH:MM:SS on :
        ($hour, $min, $sec)=split(":", $hms);

        # Create a timestamp in epoch seconds from it.  $m-1 because Perl counts months 0-11
        $ldate=timelocal($sec, $min, $hour, $d, $m-1,$y);
        $min=$min-($min%15);    # chunks of 15 minutes

        # Check which date range it's in and change the output file as appropriate
        if($ldate >= $end)
        {
                $start=timelocal($sec,$min,$hour,$d,$m-1,$y);
                $end=$start + (15*60);
                $fname=sprintf("%s-%s.csv",
                        strftime("$FMT", localtime($start)),
                        strftime("$FMT", localtime($end)));

                FILE && close(FILE);
                open(FILE, ">$fname");
        }

        print FILE "$line";
}

FILE && close(FILE);
exit 0;

ksexton · June 23, 2011, 1:21pm

Excellent.

Thanks