Grepping a timestamp range and assigning it to a constant

dsid · March 24, 2017, 7:07am

Hi,

I couldn't find any thing on google about it and have been trying to figure this out but am not getting anywhere. I want to know if its possible through a script. I have a file with columns start time and end time separated by a comma, basically there are some other columns which I need to work on, but first I wanna get this time thing sorted. Coming back to the question:

Can the start time and end time be grouped and then assigned to a constant, say CUT1, CUT2, etc or it better to do this manually?

For eg, a time between 15 - 17 is CUT 1, 18-20 is CUT 2 and 00-03 is CUT 3

START_TIME,END_TIME
16:17:10,16:24:48
16:17:10,16:17:19
16:17:11,16:17:13
16:17:14,16:23:13
16:17:18,16:26:40
16:17:22,16:22:58
16:18:30,16:25:40
16:18:38,16:23:51
16:18:53,16:25:19
16:19:07,16:19:10
16:19:11,16:26:43
16:19:47,16:19:51
16:19:54,16:29:43
16:20:02,16:20:11
16:20:04,16:20:19
16:20:06,16:20:16
00:47:31,00:47:34
00:47:37,00:48:05
00:47:43,00:57:26
00:47:43,00:48:03
00:47:48,00:55:38
00:47:49,00:47:57
16:20:08,16:20:20
16:20:10,16:21:08
16:20:12,16:20:31
18:30:54,18:32:57
18:31:56,18:32:09
18:32:08,18:32:14
18:32:17,18:32:22
01:55:32,02:04:02

Thank you

vgersh99 · March 24, 2017, 10:11am

huh??

dsid · March 24, 2017, 11:00am

I knew I was not stating the correct requirements. anyway! thanks for at least taking the time to read it.

the original file delimited by a ','. It has columns ORIGINAL_FILE_NAME,START_TIME,END_TIME

pvis_POB_80byte_mccy_1303170025574.inc,16:17:10,16:24:48
pvis_POU_80byte_mccy_1303172240000.inc,00:47:31,00:47:34
pvis_MAC_80byte_mccy_1303171715074.inc,18:31:56,18:32:09

this is the format that I am expecting. If you notice, I have replaced the 2nd and 3rd column with CUT 1/CUT 2/CUT 3

pvis_POB_80byte_mccy_1303170025574.inc,CUT 1
pvis_POU_80byte_mccy_1303172240000.inc,CUT 3
pvis_MAC_80byte_mccy_1303171715074.inc,CUT 2

I was actually going through a sed tutorial about the search and replace function and was wondering if I do a search stating that if the 2nd column(delimited by comma) falls between 15:00:00-17:00:00 replace it with a CUT 1, similarly 18:00:00-21:00:00 and the same for 00:00:00-03:00:00 , replace it with CUT 3. Once that is done, I will pipe it to a awk substr on the 1st column with '_' as the delimiter and get the 1st and 2nd field of the 1st column of the filename.

What I ultimately want is:

pvis_POB CUT 1
pvis_POU CUT 3
pvis_MAC CUT 2

Hope I am making sense now

RudiC · March 24, 2017, 11:29am

Unfortunately, your relation between start hour and the CUT No. is not linear, else this might work

awk -F, '{sub (/_[^_]*_[^_]*_[^_]*$/, _, $1); split ($2, T, ":"); print $1, "CUT", int ((T[1]+21)%24 /3)-3} ' file
pvis_POB CUT 1
pvis_POU CUT 4
pvis_MAC CUT 2

Corona688 · March 24, 2017, 11:33am

Why not just use awk to do all of that in the first place? Unlike sed, awk actually has a proper 'greater than' operator. Even better, it has variables and math, so it can actually calculate which range its in.

$ cat inputfile

junk,00:00:00,00:00:00
junk,00:30:00,00:00:00
junk,01:00:00,00:00:00
junk,01:30:00,00:00:00
junk,02:00:00,00:00:00
junk,02:30:00,00:00:00
junk,03:00:00,00:00:00
junk,03:30:00,00:00:00
junk,04:00:00,00:00:00
junk,04:30:00,00:00:00
junk,05:00:00,00:00:00
junk,05:30:00,00:00:00
junk,06:00:00,00:00:00
junk,06:30:00,00:00:00
junk,07:00:00,00:00:00
junk,07:30:00,00:00:00
junk,08:00:00,00:00:00
junk,08:30:00,00:00:00
junk,09:00:00,00:00:00
junk,09:30:00,00:00:00
junk,10:00:00,00:00:00
junk,10:30:00,00:00:00
junk,11:00:00,00:00:00
junk,11:30:00,00:00:00
junk,12:00:00,00:00:00
junk,12:30:00,00:00:00
junk,13:00:00,00:00:00
junk,13:30:00,00:00:00
junk,14:00:00,00:00:00
junk,14:30:00,00:00:00
junk,15:00:00,00:00:00
junk,15:30:00,00:00:00
junk,16:00:00,00:00:00
junk,16:30:00,00:00:00
junk,17:00:00,00:00:00
junk,17:30:00,00:00:00
junk,18:00:00,00:00:00
junk,18:30:00,00:00:00
junk,19:00:00,00:00:00
junk,19:30:00,00:00:00
junk,20:00:00,00:00:00
junk,20:30:00,00:00:00
junk,21:00:00,00:00:00
junk,21:30:00,00:00:00
junk,22:00:00,00:00:00
junk,22:30:00,00:00:00
junk,23:00:00,00:00:00
junk,23:30:00,00:00:00

$ awk -F"," -v OFS="," '{
        split($2, A, ":");
        sub(/^0/, "", A[1]); # Remove leading zero
        A[1] += 0; # Convert from string into number
        $2=sprintf("CUT %d", (A[1] / 3) + 1); # Calculate and truncate to integer
        NF=2 # Cut off third field
} 1' inputfile

junk,CUT 1
junk,CUT 1
junk,CUT 1
junk,CUT 1
junk,CUT 1
junk,CUT 1
junk,CUT 2
junk,CUT 2
junk,CUT 2
junk,CUT 2
junk,CUT 2
junk,CUT 2
junk,CUT 3
junk,CUT 3
junk,CUT 3
junk,CUT 3
junk,CUT 3
junk,CUT 3
junk,CUT 4
junk,CUT 4
junk,CUT 4
junk,CUT 4
junk,CUT 4
junk,CUT 4
junk,CUT 5
junk,CUT 5
junk,CUT 5
junk,CUT 5
junk,CUT 5
junk,CUT 5
junk,CUT 6
junk,CUT 6
junk,CUT 6
junk,CUT 6
junk,CUT 6
junk,CUT 6
junk,CUT 7
junk,CUT 7
junk,CUT 7
junk,CUT 7
junk,CUT 7
junk,CUT 7
junk,CUT 8
junk,CUT 8
junk,CUT 8
junk,CUT 8
junk,CUT 8
junk,CUT 8

$

vgersh99 · March 24, 2017, 11:34am

still not sure how exactly you're doing the CUTting, but....

awk -F_ '{print $1 FS $2, "CUT " FNR}' myFile

dsid · March 24, 2017, 12:10pm

what i meant was if the start time and end time is between 3 PM to 5 PM then what gets printed on the screen is 'CUT 1', if the time is between 6PM -10PM, then what gets printed on the screen is 'CUT 2' and finally if the time is after 00:00:00 hrs till 3AM, 'CUT 3' gets printed on the screen

something like below

cat infile|sed 's/^[15-17]/CUT1/g

@Corona688 i have not used much awk so don't know a lot about it, just here n there. I might give it a look now that you have suggested, but just that sed was something that came into mind as I needed a search and replace.

@RudiC @vgersh99 My argument is that instead of numbers(time stamps) I want constant characters in place of the time stamp so that I can sort and count it. Because of the different time stamps I am unable to sort them

hope I am clear

vgersh99 · March 24, 2017, 12:21pm

sounds like you want to place your records in the "bins" 3 hours each based on the start time.
If so, then Corona688 guessed it right.

RudiC · March 24, 2017, 12:22pm

Sorry, no.
You say that "CUT 1" is assigned if the stamp falls into the two hour interval 15:00h - 17:00h, "CUT 2" for 4 hours between 18:00h and 22:00h, and "CUT 3" for the 3 hrs from 0:00h till 3:00h. So there's no chance to develop / apply an algorithm but an "artificial" categorisation is needed.
What if the duration (start to end time) spans two intervals? What if either falls into a gap?

How far did you get with your sed approach?

Corona688 · March 24, 2017, 12:40pm

sed doesn't really have comparisons, variables, expressions, or even numbers... It doesn't have anything you're asking it to do. You'd have to tell sed how to comprehend numbers in the form of many big and ugly regular expressions like /0[0-2]:[0-9][0-9]:[0-9][0-9]/ one per range.

Or you can just use awk, which actually has numbers, variables, expressions, and comparisons, and use these numbers, variables, expressions, and comparisons to solve it in one operation instead of 8 plus further processing.

If the number of lines is intimidating, I bet I can simplify it further given certain assumptions, like column 1 never containing the : character.

dsid · March 24, 2017, 12:52pm

@vgersh99, yes I think Corona688 has understood what I want

@RudiC: there can never been over lap in the times, as the CUT's are basically a script loading files. Only when the earlier loading finishes ,then the next loading starts. But that is not the point I think. Yes it requires an artificial categorization but that is only because I'm aware the script will finish running within that particular period of time/or my artificial cuts

sample code of what I am trying to do, though its not complete, but just to show you what I meant

$ cat int_log_file_details |awk -F, '{print $2}'|tr -d ' '|sed 's/^16*/CUT1/'
CUT1:17:10
CUT1:17:10
CUT1:17:11
CUT1:17:14
CUT1:17:18
CUT1:17:22
CUT1:18:30
CUT1:18:38
CUT1:18:53
CUT1:19:07
CUT1:19:11
CUT1:19:47
CUT1:19:54
CUT1:20:02
CUT1:20:04
CUT1:20:06
00:47:31
00:47:37
00:47:43
00:47:43
00:47:48
00:47:49
CUT1:20:08
CUT1:20:10
CUT1:20:12
CUT18:30:54
CUT18:31:56
CUT18:32:08
CUT18:32:17
01:55:32

hope im clear

---------- Post updated at 04:52 PM ---------- Previous update was at 04:50 PM ----------

Let me try to do some work with it.

Corona688 · March 24, 2017, 1:00pm

Lots of code here is less than ideal for purposes of brevity and using the tools available, but doing it in sed is just so pointlessly difficult, vs it being so laughably easy in awk, that I really don't understand keeping to the path of maximum resistance. One of these tools is literally meant for the job, having the tools you need to express the problem. The other... isn't. You already use awk for more than '{print $1}' so this is just one step more than what you know.

dsid · March 24, 2017, 1:12pm

going through your awk code now, let me see what I can come up with, but just to let you know I have in the mean time come up with the below somewhat ugly code

[dsiddiqui@lxserv01 scripts]$ cat int_log_file_details
pvis_POB_80byte_mccy_1303170025574.inc,16:17:10,16:24:48
PVIS_PPL_80byte_1303170010001.inc,     16:17:10,16:17:19
pvis_POU_80byte_mccy_1203172240000.inc,16:17:11,16:17:13
pvis_WRT_80byte_mccy_1303171335182.inc,16:17:14,16:23:13
pvis_WRT_80byte_mccy_1303170035595.inc,16:17:18,16:26:40
pvis_POB_80byte_mccy_1203170725442.inc,16:17:22,16:22:58
pvis_POB_80byte_mccy_1303171325241.inc,16:18:30,16:25:40
pvis_POB_80byte_mccy_1303170725547.inc,16:18:38,16:23:51
pvis_WRT_80byte_mccy_1203171335476.inc,16:18:53,16:25:19
pvis_WLB_80byte_mccy_1203171725359.inc,16:19:07,16:19:10
pvis_POB_80byte_mccy_1203171325525.inc,16:19:11,16:26:43
pvis_DSB_80byte_mccy_1203171735351.inc,16:19:47,16:19:51
pvis_POB_80byte_mccy_1203171926002.inc,16:19:54,16:29:43
pvis_MAC_80byte_mccy_1203171715358.inc,16:20:02,16:20:11
pvis_CAD_80byte_mccy_1303170745408.inc,16:20:04,16:20:19
pvis_CAD_80byte_mccy_1203171945363.inc,16:20:06,16:20:16
pvis_POU_80byte_mccy_1303172240000.inc,00:47:31,00:47:34
pvis_CAD_80byte_mccy_1303171945098.inc,00:47:37,00:48:05
pvis_POB_80byte_mccy_1303171925327.inc,00:47:43,00:57:26
PVIS_PPL_80byte_1403170030002.inc,     00:47:43,00:48:03
pvis_POB_80byte_mccy_1403170025289.inc,00:47:48,00:55:38
PVIS_PPL_80byte_1403170010001.inc,     00:47:49,00:47:57
pvis_CAD_80byte_mccy_1203170745323.inc,16:20:08,16:20:20
pvis_80byte_mccy_1203171705377.inc,    16:20:10,16:21:08
PVIS_PPL_80byte_1303170030002.inc,     16:20:12,16:20:31
pvis_80byte_mccy_1303171705113.inc,    18:30:54,18:32:57
pvis_MAC_80byte_mccy_1303171715074.inc,18:31:56,18:32:09
pvis_WLB_80byte_mccy_1303171725075.inc,18:32:08,18:32:14
pvis_DSB_80byte_mccy_1303171735086.inc,18:32:17,18:32:22
pvis_WRT_80byte_mccy_1403170035301.inc,01:55:32,02:04:02
[dsiddiqui@lxserv01 scripts]$ cat int_log_file_details |awk -F, '{print $2}'|tr -d ' '
16:17:10
16:17:10
16:17:11
16:17:14
16:17:18
16:17:22
16:18:30
16:18:38
16:18:53
16:19:07
16:19:11
16:19:47
16:19:54
16:20:02
16:20:04
16:20:06
00:47:31
00:47:37
00:47:43
00:47:43
00:47:48
00:47:49
16:20:08
16:20:10
16:20:12
18:30:54
18:31:56
18:32:08
18:32:17
01:55:32
[dsiddiqui@lxserv01 scripts]$ cat int_log_file_details |awk -F, '{print $2}'|tr -d ' '|sed 's/^\(16.*\)/CUT1/'|sed 's/^\(18.*\)/CUT2/'|sed 's/^\(00.*\)/CUT3/'|sed 's/^\(01.*\)/CUT3/'
CUT1
CUT1
CUT1
CUT1
CUT1
CUT1
CUT1
CUT1
CUT1
CUT1
CUT1
CUT1
CUT1
CUT1
CUT1
CUT1
CUT3
CUT3
CUT3
CUT3
CUT3
CUT3
CUT1
CUT1
CUT1
CUT2
CUT2
CUT2
CUT2
CUT3
[dsiddiqui@lxserv01 scripts]$

Corona688 · March 24, 2017, 1:19pm

Simpler version which doesn't need any arrays, just substitution and math:

awk -F"[,:]" -v OFS="," '{
        sub(/^0/, "", $2);      # Remove leading zero
        $2=sprintf("CUT %d", ($2/3)+1); # Set column 2 to "CUT N"
        NF=2                    # Set number of columns to 2
        print                   # print
}'

dsid · March 24, 2017, 1:29pm

corona688:

Simpler version which doesn't need any arrays, just substitution and math:

awk -F"[,:]" -v OFS="," '{
   sub(/^0/, "", $2);      # Remove leading zero
   $2=sprintf("CUT %d", ($2/3)+1); # Set column 2 to "CUT N"
   NF=2                    # Set number of columns to 2
   print                   # print
}'

in your code it is printing CUT 6 for a timestamp which is 4 PM, which is actually CUT 1. CUT 1 is between 3pm-5pm, CUT 2 is between 6pm-10pm and CUT 3 is between 00:00 - 3am.

vgersh99 · March 24, 2017, 1:33pm

does it really matter what "cut" corresponds to what time bucket?
you said previously:

wouldn't bucketing the records 3 hours each be sufficient for the above purpose?

dsid · March 24, 2017, 1:42pm

once i figure out what CUT that particular time period is from, Im basically assigning similar CUT names to a time period range. Once that is finished, I will club all the incoming files in that CUT and will count the total number of files belonging to that CUT

RudiC · March 24, 2017, 1:56pm

Try

sed -r 's/,1[567](:[0-9]{2}){2}/ CUT 1/; s/,(1[89]|2[012])(:[0-9]{2}){2}/ CUT 2/; s/,0[012](:[0-9]{2}){2}/ CUT 3/' file

dsid · March 24, 2017, 2:01pm

sure. I am travelling now, once I reach home I will get back with an update

Corona688 · March 24, 2017, 2:22pm

Thank you for clarifying this, we've been asking for a while if that was a typo or not.