Grouping and Subgrouping using awk

hemanty4u · August 31, 2015, 11:00am

I have a data which looks like

1440993600|L|ABCDEF
1440993600|L|ABCD
1440993601|L|ABCDEF
1440993602|L|ABC
1440993603|L|ABCDE
.
.
.

1441015200|L|AB
1441015200|L|ABC
1441015200|L|ABCDEF

So basically, the $1 is epoch date, $2 and $3 is some application data

From one if the threads, I was following I am using this code to group the 1st column based on the interval

awk -F "|" '{while ($1>=t*w) t++; A[t]++} END {for (i=1;i<=t;i++) print "    "strftime("%F %T",(i-1)*w)" to "strftime("%F %T",i*w-1)"|"(A)}' w=3600 ${work_file} |tail

What I need is sub-group based on 3rd column as well
Meaning if the interval is say 3600 (1 hours) the output should be

1440993600 to 1440997200|ABCDEF|2
1440993600 to 1440997200|ABCD|1
1440993600 to 1440997200|ABC|1
1440993600 to 1440997200|ABCDE|1
1440997200 to 1441000800|ABCDE|12
1440997200 to 1441000800|ABC|3
1441000800 to 1441004400|ABCD|5
1441000800 to 1441004400|ABCDE|3
1441000800 to 1441004400|ABCDEF|7
.
.
.

RudiC · August 31, 2015, 11:29am

Any attempts from your side?

---------- Post updated at 17:29 ---------- Previous update was at 17:28 ----------

Howsoever, try

awk -F\| '
                {INTV=int($1/3600)
                 I[INTV]++
                 AD[$3]++
                 T[INTV,$3]++
                }
END             {for (i in I) for (a in AD) if (T[i,a]) print i*3600 " to " (i+1)*3600, a, T[i,a]
                }
' OFS="|" file
1441015200 to 1441018800|AB|1
1441015200 to 1441018800|ABCDEF|1
1441015200 to 1441018800|ABC|1
1440993600 to 1440997200|ABCDEF|2
1440993600 to 1440997200|ABCDE|1
1440993600 to 1440997200|ABCD|1
1440993600 to 1440997200|ABC|1

Scrutinizer · August 31, 2015, 1:50pm

Another awk:

awk '$1>i{i=$1+3600} {A[i FS $3]++} END{for(i in A) {$0=i; print $1-3600 " to " $0, A}}' FS=\| OFS=\| file | sort

hemanty4u · September 1, 2015, 10:56am

Thanks All

In relation to the same data file, I am trying to group using the command I mentioned above

awk -F "|" '{while ($1>=t*w) t++; A[t]++} END {for (i=1;i<=t;i++) print "    "strftime("%F %T",(i-1)*w)" to "strftime("%F %T",i*w-1)"|"(A)}' w=3600 ${work_file} |tail

This code behaves correctly for small set data in a file
The moment the data is like some 2M records, it does not throw correct results

For example in the same data example above, when I run it over 2M records, I get the output as this way

    2015-09-01 00:00:00 to 2015-09-01 00:59:59|377387
    2015-09-01 01:00:00 to 2015-09-01 01:59:59|372157
    2015-09-01 02:00:00 to 2015-09-01 02:59:59|386135
    2015-09-01 03:00:00 to 2015-09-01 03:59:59|335708
    2015-09-01 04:00:00 to 2015-09-01 04:59:59|382802
    2015-09-01 05:00:00 to 2015-09-01 05:59:59|6449915

The last count 6449915 is actually not correct
It is displayed as

Total records in the file minus (sum of 377387 + 372157 + 386135 + 335708 + 382802) which is visually not correct at all

RudiC · September 1, 2015, 2:49pm

@Scrutinizer: brilliant and terse; works well if the first epoch time in a group is an integer multiple of 3600 but shifts the entire range output if the time has some additional seconds in it.

@hemanty4u: With the data as given, t is calculated by successive addition of 1, i.e. 400277 loops, while one single division yields the same result. And, in the END section, you print 400283 times your output line. No surprise run time is that long! Try for (a in A) print a, A[a] in the END section.

Scrutinizer · September 1, 2015, 4:17pm

@RudiC. Good point, thanks. So this adaptation should improve things:

awk '$1>i{i=int($1/t+0.5)*t+t} {A[i FS $3]++} END{for(i in A) {$0=i; print $1-t " to " $0, A}}' t=3600 FS=\| OFS=\| file | sort