Count lines with awk if statements

Elly · December 9, 2016, 2:17pm

Hi Everybody,

I wanna count lines in many files, but only if they meet a condition, I have something like this,

cat /path1/usr/STAT/GPRS/ESTCOL_GPRS_2016* | awk 'BEGIN{FS=",";}{ if (substr($5,1,8)=='$DATE'){a[FILENAME]++} END{for(i in a)print a}}'
DATE=$(date +%Y%m%d -d "1 day ago")

But it has some bug, can anybody help me? thank you

greet_sed · December 9, 2016, 2:35pm

Hi,

Can you try like this?

DATE=$(date +%Y%m%d -d "1 day ago")
awk -F, -vy=DATE '$0 ~ y {a[FILENAME]++ } END { for (i in a) { print i a }}' /path1/usr/STAT/GPRS/ESTCOL_GPRS_2016*

If needed, add substr() for strict regex.

Elly · December 9, 2016, 4:31pm

Thank you, it works!!!

There is some way to save the result in an array and the sum it to get only value?

RavinderSingh13 · December 9, 2016, 6:39pm

Hello Elly,

Not sure what you mean by above completely.
i- So if you want to get only number of matches in per file of given date then following may help you in same.

DATE=$(date +%Y%m%d -d "1 day ago")
awk -F, -vy=DATE '$0 ~ y {a[FILENAME]++ } END { for (i in a) { print a }}' /path1/usr/STAT/GPRS/ESTCOL_GPRS_2016*

ii- If you want to a collective SUM of all the files processed then following may help you in same(Not tested though).

DATE=$(date +%Y%m%d -d "1 day ago")
awk -F, -vy=DATE '$0 ~ y {a[FILENAME]++ } END { for (i in a) { SUM+=a};print SUM}' /path1/usr/STAT/GPRS/ESTCOL_GPRS_2016*

Thanks,
R. Singh

Elly · December 9, 2016, 7:10pm

Hi RavinderSingh13, thank you very much,

I have made some tests with your help and for my case, It's much more comfortable for me this way:

awk 'BEGIN{FS=",";}{ if (substr($5,1,8)=="20161208") a[$2]++ } END { for (i in a) { print i "," a}}'

The result, for a file with lines like this (

COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161209144744_00101038.cdr,20161209150007,20161209,225,51535

):

ALK_01P,2540

But, the value "2.540" is not correct should be "2,498", if I modify the

a[$2]++

by this

a[$4]++

, this bring me all lines $4 that contains strings like this--> processed_cdr_20161209144744_00101038.cdr , if I sum all this lines, give the correct number 2,498, so, I guess the problem is the Increment mode, ++, I need the sum value of all this lines ($4)

Thank you very much

greet_sed · December 10, 2016, 6:01am

Hi,

It depends on what you have in $5 and if condition succeeds.
value is sumed up only when if condition is executed.

I tried as follows and looks fine :

cat f1
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161209144744_00101038.cdr,20161209150007 ,20161209,225,51535
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161208144744_00101038.cdr,20161208150007 ,20161209,225,51535

cat f2
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161209144744_00101038.cdr,20161209150007 ,20161209,225,51535
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161208144744_00101038.cdr,20161208150007 ,20161209,225,51535
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20151209144744_00101038.cdr,20161209150007 ,20161209,225,51535
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20151208144744_00101038.cdr,20161208150007 ,20161209,225,51535

cat f3
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161209144744_00101038.cdr,20161209150007 ,20161209,225,51535
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161209144744_00101038.cdr,20161208150007 ,20161209,225,51535
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161209144744_00101038.cdr,20161209150007 ,20161209,225,51535

awk -F, '{ if (substr($5,1,8)=="20161208") a[$2]++ } END { for (i in a) { print i "," a}}' *

Gives below output because $5 is matched only in 4 lines matched from those 3 files and $2 is same in all those match.

awk -F, '{ if (substr($5,1,8)=="20161208") a[$4]++ } END { for (i in a) { print i "," a}}' *

Gives below output because $5 is matched ( same as above ) only in 4 lines matched from those files BUT $4 is different from those match.

processed_cdr_20151208144744_00101038.cdr,1
processed_cdr_20161209144744_00101038.cdr,1
processed_cdr_20161208144744_00101038.cdr,2

If it does not help, please share sample input & expected output.

RudiC · December 10, 2016, 11:49am

You lost me. I couldn't imagine WHAT you really need.

In post#1, you cat all matching files into a pipe to awk and then sum into array a indexed by FILENAME. As there's only ONE single stream (by cat ), there will be just one element with index "-".

This has been cured in the proposals by greet_sed and RavinderSingh13.

Still your problem is not clear.

The count of lines with substr ($5,1,8) matching $DATE CANNOT depend on the index ($2 / $4 ?) of the a array. WHY should there be different counts (2540 <-> 2498)?

And, 20161208 doesn't match $5 in your sample, so count must be zero.

Why don't you take a step back and start over, carefully (re)formulating your specification, supplying a reasonable set of input data and a desired output format, and the logics connecting the two?