Split large file into 24 small files on one hour basis

Raghuram717 · June 26, 2019, 5:14am

I Have a large file with 24hrs log in the below format.i need to split the large file in to 24 small files on one hour based.i.e ex:from 09:55 to 10:55,10:55-11:55
can any one help me on this.!

timeStamp,elapsed,label,responseCode,responseMessage,threadName,dataType,success,failureMessage,bytes,grpThreads,allThreads,URL,Latency,IdleTime
2019-05-20 09:55:39.945,142,GETHome,200,OK,SequenceLength 1-50,text,true,,9937,1,2,http://bus00bpo.us.oracle.com:8181/,141,0
2019-05-20 09:55:33.820,217,GETHome,200,OK,SequenceLength 1-50,text,true,,9937,1,2,http://bus00bpo.us.oracle.com:8181/,217,0
2019-05-20 09:55:34.550,103,GETregistry,200,OK,SequenceLength 1-50,text,true,,66305,1,2,http://bus00bpo.us.oracle.com:8181/ccstoreui/v1/registry,102,0
2019-05-20 09:55:35.739,47,GETPublish,200,OK,SequenceLength 1-50,text,true,,1068,1,2,http://bus00bpo.us.oracle.com:8181/ccstoreui/v1/publish,46,0
2019-05-20 09:55:36.016,1254,GEThomeLayout,200,OK,SequenceLength 1-50,text,true,,170861,1,2,http://bus00bpo.us.oracle.com:8181/ccstoreui/v1/pages/layout/home?ccvp=lg,1252,0
2019-05-20 09:55:38.069,356,GEThomeCurrentData,200,OK,SequenceLength 1-50,text,true,,3312,1,2,http://bus00bpo.us.oracle.com:8181/ccstoreui/v1/pages/home?dataOnly=false&currentDataOnly=true,356,0
2019-05-20 09:55:39.413,454,GEThomeCacheableData,200,OK,SequenceLength 1-50,text,true,,19239,1,2,http://bus00bpo.us.oracle.com:8181/ccstoreui/v1/pages/home?dataOnly=false&cacheableDataOnly=true&productTypesRequired=true,453,0
2019-05-20 09:55:40.564,38,GEThomeCSS,204,No Content,SequenceLength 1-50,,true,,517,1,2,http://bus00bpo.us.oracle.com:8181/ccstoreui/v1/pages/css/home?occsite=siteUS,0,0
.....
....
....
2019-05-21 10:10:31.197,11,GETresourceNsCommon,200,OK,SequenceLength 1-1,text,true,,29169,1,1,http://bus00bpo.us.oracle.com:8181/ccstoreui/v1/resources/ns.common?locale=en,11,0

Neo · June 26, 2019, 5:20am

What have you tried?

What scripts have you written?

What are you attempts to process this file yourself?

Raghuram717 · June 26, 2019, 5:32am

i tried with the for loop but it didn't worked for me.i am sorry to say i am not good at AWK. looking for any small script on AWK which will help me so that i posted the question

RudiC · June 26, 2019, 6:22am

Why those odd numbers, why not 9:00h - 10:00h etc?

Why not midnight till midnight?

Raghuram717 · June 26, 2019, 6:32am

Actually it is like log from a test run ..so when ever the test run is started, this log will start rolling so cant consider it as midnight till midnight..!
if i can pass the test run starting time as 09:55 so that it should split in to 24 files accordingly
can u help me in this scenario.?

RudiC · June 26, 2019, 6:47am

Zeroth approximation - try and see if you can adapt it:

grep -E "2019-05-20 (09:5[5-9]|10:5?[0-4])" file

Might be interesting to see the regex around midnight.

Neo · June 26, 2019, 7:03am

If it was me, who processes text files with PHP, I would do something like this:

use the PHP date function, like so:

date(h,$timestamp);

// read the file.
// foreach line in the file as entry
// get the formatted date in each entry and convert it to unix timestamp ($timestamp)
$hour = date(h,$timestamp);
// write append entry to 'filename_ '.$hour.'.txt' or whatever you want to call these 24 files.

This is almost trivial in PHP using date() .

Hope this helps.

EDIT: See next post for a more complete PHP solution.

Neo · June 26, 2019, 7:09am

Actually, you do not even need to convert to a unix timestamp in PHP, you can use your formatted time string directly:

For example, I just tested your example data with your formatted time :

<?php
echo  date(h,"2019-05-20 09:55:39.945");

06

No need to convert to a unix timestamp.

Untested, but close:

<?php
$data = file_get_contents('mydata.txt');
foreach ($data as $entry) {
    $my_time_stuff = explode(' ', $entry);
    $formatted_time = $my_time_stuff[0] . ' ' . $my_time_stuff[1];
    $hour = date(h, $formatted_time);
    $file = 'myfilename' . $hour . '.txt';
    file_put_contents($file, $entry, FILE_APPEND | LOCK_EX);
}

Hope this PHP example helps.

If you want it to be "harder to read" and short:

<?php
$data = file_get_contents('mydata.txt');
foreach ($data as $entry) {
    $m= explode(' ', $entry);
    file_put_contents('myfilename' . date(h, $m[0] . ' ' . $m[1]) . '.txt', $entry, FILE_APPEND | LOCK_EX);
}

I prefer the long, easy to read, more-or-less self-documenting (first) version.

Raghuram717 · June 28, 2019, 2:50am

@Neo okay i will try this

MadeInGermany · June 28, 2019, 5:32am

Here is a shell script that simply splits on the hour (00...23):

#!/bin/sh
split_to(){
  prev=""
  read header
  while IFS=" :" read date hour rest
  do
    case $hour in
    ([0-9][0-9])
      if [ "$hour" != "$prev" ]
      then
        exec >"$hour.$1"
        prev=$hour
        echo "$header"
      fi
      echo "${date} ${hour}:${rest}"
    ;;
    esac
  done
}

if [ $# -eq 0 ]
then
  split_to out
fi
for arg
do
  split_to "$arg" < "$arg"
done

Run with /path/to/scriptname filename to get 00.filename ... 23.filename
or with /path/to/scriptname < filename to get 00.out ... 23.out

Raghuram717 · July 1, 2019, 2:25am

i Wrote some script in shell that was working for me
posting here for reference

timestamp=$1

z=1
 max=24
 for (( i=1; i <= $max; ++i ))
 do
 echo "starting date is $timestamp"
 get_starting_line_number=$(grep -nr "$timestamp" summaryReport_LoadTest.csv|head -1|cut -d: -f1)
 echo "starting_line_number is $get_starting_line_number"
 ntimestamp=$(date -d "$timestamp 1 hour" +"%Y-%m-%d %H:%M:%S")
 echo "ending date is $ntimestamp"
 get_ending_line_number=$(grep -nr "$ntimestamp" summaryReport_LoadTest.csv|tail -1|cut -d: -f1)
 echo "ending line number is $get_ending_line_number"
 awk 'NR >= '$get_starting_line_number' && NR <= '$get_ending_line_number'' summaryReport_LoadTest.csv > summaryReport_LoadTest_"$z".csv
 z=$(expr $z + 1)
 echo $z
 b=$(expr $get_ending_line_number - 1)
 sed -i ''$get_starting_line_number','$b'd'  summaryReport_LoadTest.csv
 timestamp=$ntimestamp
 echo "**********************************************************************************************"
done

RudiC · July 1, 2019, 4:00am

Try also

paste <(date +%s -f<(cut -d, -f1 file) 2>&1) file | awk -F"\t" '
NR == 1         {sub ($1 FS, "")
                 HD = $0
                }
NR == 2         {BDT = $1 - $1%60
                }
NR >  1         {HCNT = int (($1 - BDT)/3600)
                 if (HCNT > 23) exit
                 FN = "summaryReport_LoadTest_" HCNT ".csv"
                 sub ($1 FS, "")
                 if (!X[FN]++) print HD  >  FN
                 print  >  FN
                }
'

It calculates the start from the first timestamp encountered in line 2. If you want to pass the start from outside, set BDT with awk 's -v option. Then, to skip the leading lines outside the target window, additional logics need to be installed.

Raghuram717 · July 1, 2019, 5:57am

getting error like this...

./script3.sh: line 1: syntax error near unexpected token `('
./script3.sh: line 1: `paste <(date +%s -f<(cut -d, -f1 file) 2>&1) file | awk -F"\t" ''

RudiC · July 1, 2019, 6:15am

What is your shell?

NEVER forget to tell your OS and shell in a thread.

Raghuram717 · July 1, 2019, 7:36am

Bash shell and os is Linux

RudiC · July 1, 2019, 9:06am

The proposal has been tested on linux / bash and worked fine. What's your bash version? What's the result of the paste command (without the pipe).

Raghuram717 · July 2, 2019, 8:58am

sorry Rudic it worked fine..!
but this script u had given below is working for 09:55-10:55
can u help me with along seconds too 09:55:39---10:55:39

paste <(date +%s -f<(cut -d, -f1 file) 2>&1) file | awk -F"\t" '
NR == 1         {sub ($1 FS, "")
                 HD = $0
                }
NR == 2         {BDT = $1 - $1%60
                }
NR >  1         {HCNT = int (($1 - BDT)/3600)
                 if (HCNT > 23) exit
                 FN = "summaryReport_LoadTest_" HCNT ".csv"
                 sub ($1 FS, "")
                 if (!X[FN]++) print HD  >  FN
                 print  >  FN
                }
'

because when i checked the lines when i counted with 09:55-10:55 i got it as 827785
but the script which i wrote is calculating from 09:55:39-10:55:39 is giving 838331 lines
the problem with my script is it is running for 10min where as the awk solution u had given is running for 1 min
so could you please modify the script which will help with seconds 09:55:39---10:55:39????

RudiC · July 2, 2019, 12:29pm

That should not be a problem as there is one statement that strips the seconds from the start timestamp found in the first data line. Can you identify and correct it to your needs?

If you supply the start time from the outside, the problem should not exist.

Be aware that in the sample data you supplied 5 lines would NOT print if you define the start time as 09:55:39 (that's the reason I stripped the seconds, BTW).

Raghuram717 · July 3, 2019, 2:22am

Sorry Rudic..I I could see already the first line is converting the timestamp to epoch along with seconds.There is something which need to be modified in the body of the script
i tried it... but i am not successful and i am very sorry for that as i am very new to AWK.
could you please help me in this if you dont mind..!

$ paste <(date +%s -f<(cut -d, -f1 file) 2>&1) file |head -3
timeStamp,elapsed,label,responseCode,responseMessage,threadName,dataType,success,failureMessage,bytes,grpThreads,allThreads,URL,Latency,IdleTime
1558360539      2019-05-20 09:55:39.945,142,GETHome,200,OK,SequenceLength 1-50,text,true,,9937,1,2,http://bus00bpo.us.oracle.com:8181/,141,0
1558360533      2019-05-20 09:55:33.820,217,GETHome,200,OK,SequenceLength 1-50,text,true,,9937,1,2,http://bus00bpo.us.oracle.com:8181/,217,0
$

Raghuram717 · July 5, 2019, 6:37am

I got the script for seconds...i just removed the $1%60 and it worked fine for 09:55:39-10:55:38
Thanks for the help

Final working script:

paste <(date +%s -f<(cut -d, -f1 file) 2>&1) file | awk -F"\t" '
NR == 1         {sub ($1 FS, "")
                 HD = $0
                }
NR == 2         {BDT = $1 - $1
                }
NR >  1         {HCNT = int (($1 - BDT)/3600)
                 if (HCNT > 23) exit
                 FN = "summaryReport_LoadTest_" HCNT ".csv"
                 sub ($1 FS, "")
                 if (!X[FN]++) print HD  >  FN
                 print  >  FN
                }
'