Count lines with awk if statements

Hi Everybody,

I wanna count lines in many files, but only if they meet a condition, I have something like this,

cat /path1/usr/STAT/GPRS/ESTCOL_GPRS_2016* | awk 'BEGIN{FS=",";}{ if (substr($5,1,8)=='$DATE'){a[FILENAME]++} END{for(i in a)print a}}'
DATE=$(date +%Y%m%d -d "1 day ago")

But it has some bug, can anybody help me? thank you :slight_smile:

Hi,

Can you try like this?

DATE=$(date +%Y%m%d -d "1 day ago")
awk -F, -vy=DATE '$0 ~ y {a[FILENAME]++ } END { for (i in a) { print i a }}' /path1/usr/STAT/GPRS/ESTCOL_GPRS_2016*

If needed, add substr() for strict regex.

1 Like

Thank you, it works!!!

There is some way to save the result in an array and the sum it to get only value?

Hello Elly,

Not sure what you mean by above completely.
i- So if you want to get only number of matches in per file of given date then following may help you in same.

DATE=$(date +%Y%m%d -d "1 day ago")
awk -F, -vy=DATE '$0 ~ y {a[FILENAME]++ } END { for (i in a) { print a }}' /path1/usr/STAT/GPRS/ESTCOL_GPRS_2016*

ii- If you want to a collective SUM of all the files processed then following may help you in same(Not tested though).

DATE=$(date +%Y%m%d -d "1 day ago")
awk -F, -vy=DATE '$0 ~ y {a[FILENAME]++ } END { for (i in a) { SUM+=a};print SUM}' /path1/usr/STAT/GPRS/ESTCOL_GPRS_2016*

Thanks,
R. Singh

Hi RavinderSingh13, thank you very much,

I have made some tests with your help and for my case, It's much more comfortable for me this way:

awk 'BEGIN{FS=",";}{ if (substr($5,1,8)=="20161208") a[$2]++ } END { for (i in a) { print i "," a}}'

The result, for a file with lines like this (

COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161209144744_00101038.cdr,20161209150007,20161209,225,51535

):

ALK_01P,2540

But, the value "2.540" is not correct should be "2,498", if I modify the

a[$2]++

by this

a[$4]++

, this bring me all lines $4 that contains strings like this--> processed_cdr_20161209144744_00101038.cdr , if I sum all this lines, give the correct number 2,498, so, I guess the problem is the Increment mode, ++, I need the sum value of all this lines ($4)

Thank you very much

Hi,

It depends on what you have in $5 and if condition succeeds.
value is sumed up only when if condition is executed.

I tried as follows and looks fine :

cat f1
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161209144744_00101038.cdr,20161209150007 ,20161209,225,51535
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161208144744_00101038.cdr,20161208150007 ,20161209,225,51535
cat f2
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161209144744_00101038.cdr,20161209150007 ,20161209,225,51535
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161208144744_00101038.cdr,20161208150007 ,20161209,225,51535
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20151209144744_00101038.cdr,20161209150007 ,20161209,225,51535
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20151208144744_00101038.cdr,20161208150007 ,20161209,225,51535
cat f3
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161209144744_00101038.cdr,20161209150007 ,20161209,225,51535
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161209144744_00101038.cdr,20161208150007 ,20161209,225,51535
COMGPRS,ALK_01P,COMGPRS_ALK_01P_095398.dat,processed_cdr_20161209144744_00101038.cdr,20161209150007 ,20161209,225,51535
awk -F, '{ if (substr($5,1,8)=="20161208") a[$2]++ } END { for (i in a) { print i "," a}}' *

Gives below output because $5 is matched only in 4 lines matched from those 3 files and $2 is same in all those match.

awk -F, '{ if (substr($5,1,8)=="20161208") a[$4]++ } END { for (i in a) { print i "," a}}' *

Gives below output because $5 is matched ( same as above ) only in 4 lines matched from those files BUT $4 is different from those match.

processed_cdr_20151208144744_00101038.cdr,1
processed_cdr_20161209144744_00101038.cdr,1
processed_cdr_20161208144744_00101038.cdr,2

If it does not help, please share sample input & expected output.

1 Like

You lost me. I couldn't imagine WHAT you really need.

In post#1, you cat all matching files into a pipe to awk and then sum into array a indexed by FILENAME. As there's only ONE single stream (by cat ), there will be just one element with index "-".

  • This has been cured in the proposals by greet_sed and RavinderSingh13.

Still your problem is not clear.

The count of lines with substr ($5,1,8) matching $DATE CANNOT depend on the index ($2 / $4 ?) of the a array. WHY should there be different counts (2540 <-> 2498)?

And, 20161208 doesn't match $5 in your sample, so count must be zero.

Why don't you take a step back and start over, carefully (re)formulating your specification, supplying a reasonable set of input data and a desired output format, and the logics connecting the two?