Shell script count lines and sum numbers from multiple files

Elly · November 17, 2016, 2:43pm

I want to count the number of lines, I need this result be a number, and sum the last numeric column, I had done to make this one at time, but I need to make this for a crontab, so, it has to be an script, here is my lines:

It counts the number of lines:

egrep -i String file_name_201611* | egrep -i "cdr,20161115" | awk -F"," '{print $4}' | sort | uniq | wc -l

This sum the number of last column:

egrep -i String file_name_201611* | egrep -i ".cdr,20161115"| awk -F"," '{print $8}' | paste -s -d"+" |bc

Lines looks like:

COMGPRS,CGSCO05,COMGPRS_CGSCO05_400594.dat,processed_cdr_20161117100941_00627727.cdr,20161117095940,20161117,18,46521

The expected output:

CGSCO05,sum_#_lines, Sum_$8

CGSCO05, 225, 1500

Any idea?

RudiC · November 17, 2016, 3:49pm

That request is not too clear, and sample input data missing doesn't help either. Just a guess based on some assumptions, (untested!):

awk -F, '/String/ && /\.cdr,20161115/ {CNT4++; SUM8 += $8} END {print $2, CNT4, SUM8} ' OFS=, file_name_201611*

Here, the uniq effect is not accounted for, nor is case sensitivity for the search strings. The field 2 printed is the one from the last line in the input stream.
If you can't live with any of these shortcomngs, be way clearer in your description.

Elly · November 17, 2016, 4:11pm

Hi RudiC, I am sorry my mistake, I was not clear up, so, look:

I have a directoy with files .CSV with names like this : ESTCOL_GPRS_201611*, these files are a lot, in the * are hour, Minute and seconds. The contents of these files are strings like these:

COMGPRS,CGHW12,COMGPRS_CGHW12_610617.dat,processed_cdr_20161117061743_01680861.cdr,20161117060116,20161117,225,42832

I wanna to count the unique occurrences for the column 2 and sum all of them and the values sum of column 8 to have an output like this:
Files Records
2,433 , 119,930,636

I have been trying something like this, but yet I have not achieved it

#!/bin/awk -f
BEGIN {
        FS=",";
}
{
        if (($1 == "COMGPRS") && ($2 == "ALK_01P")) {
                if (substr($5,1,8) == "20161115") {
                    sum+=$8;
                }
        }
}
END {
                print "Registros," $2 "," $sum;
}

I have not looked even count the number of files (Lineas unique)

Don_Cragun · November 17, 2016, 4:36pm

Given that the only line that you have shown us from your input file(s) is not matched by either of the egrep s in either of your pipelines, it is hard to guess how to create test data that might be used by see if we are correctly interpreting your requirements.

Your 1st pipeline seems to be attempting to count a number of unique field #4 values. But your expected output shows sum_#_lines ... What is being summed?

Your 2nd pipeline seems straightforward, but one wonders why the patterns being matched by the 2nd egrep is those two pipelines is different???

And, of course, the search patterns used in the awk script shown in post #3 do not seem to have any relationship to what you showed us in post #1???

Please show us a small set of sample input lines and then show us the exact output that should be produced from that sample along with a clear explanation of the logic used to produce that output from your sample input.

Elly · November 17, 2016, 4:47pm

Hi Don Cragun, I am sorry by the confusion,

The number of unique occurrences, how many times uniques files

egrep -i ALK_01P ESTCOL_GPRS_201611* | egrep -i "cdr,20161116" | awk -F"," '{print $4}' | sort | uniq | wc -l

The result:

this result is total number of files

The second is to sum the values of #8, here are some lines :

COMGPRS,ALK_01S,COMGPRS_ALK_01S_018555.dat,processed_cdr_20161117055325_00018556.cdr,20161117060108,20161117,18,45533
COMGPRS,MEG_03P,COMGPRS_MEG_03P_030770.dat,processed_cdr_20161117055016_00033056.cdr,20161117060109,20161117,225,49187
COMGPRS,CGSCO05,COMGPRS_CGSCO05_400108.dat,processed_cdr_20161117060701_00627241.cdr,20161117060109,20161117,18,46050
COMGPRS,CGSCO05,COMGPRS_CGSCO05_400109.dat,processed_cdr_20161117060757_00627242.cdr,20161117060110,20161117,18,45848
COMGPRS,ALK_01S,COMGPRS_ALK_01S_018556.dat,processed_cdr_20161117055449_00018557.cdr,20161117060111,20161117,18,45089
COMGPRS,MEG_03P,COMGPRS_MEG_03P_030771.dat,processed_cdr_20161117055108_00033057.cdr,20161117060112,20161117,225,48409
COMGPRS,CGHW12,COMGPRS_CGHW12_610616.dat,processed_cdr_20161117061631_01680860.cdr,20161117060112,20161117,225,43037
COMGPRS,MEG_03P,COMGPRS_MEG_03P_030772.dat,processed_cdr_20161117055201_00033058.cdr,20161117060112,20161117,225,49096
COMGPRS,CGSCO05,COMGPRS_CGSCO05_400110.dat,processed_cdr_20161117060852_00627243.cdr,20161117060113,20161117,18,45474
COMGPRS,MEG_03P,COMGPRS_MEG_03P_030773.dat,processed_cdr_20161117055253_00033059.cdr,20161117060113,20161117,225,48855
COMGPRS,CGSCO05,COMGPRS_CGSCO05_400111.dat,processed_cdr_20161117060947_00627244.cdr,20161117060114,20161117,18,45229
COMGPRS,CGHW12,COMGPRS_CGHW12_610617.dat,processed_cdr_20161117061743_01680861.cdr,20161117060116,20161117,225,42832
COMGPRS,CGHW12,COMGPRS_CGHW12_610618.dat,processed_cdr_20161117061852_01680862.cdr,20161117060120,20161117,225,43142
COMGPRS,ALK_02P,COMGPRS_ALK_02P_030792.dat,processed_cdr_20161117054847_00032422.cdr,20161117060206,20161117,225,48781
COMGPRS,ALK_02P,COMGPRS_ALK_02P_030793.dat,processed_cdr_20161117054941_00032423.cdr,20161117060206,20161117,225,47695
COMGPRS,CGVEN08,COMGPRS_CGVEN08_770418.dat,processed_cdr_20161117061228_02136487.cdr,20161117060207,20161117,225,42512
COMGPRS,ALK_02P,COMGPRS_ALK_02P_030794.dat,processed_cdr_20161117055035_00032424.cdr,20161117060207,20161117,225,48761
COMGPRS,ALK_02P,COMGPRS_ALK_02P_030795.dat,processed_cdr_20161117055129_00032425.cdr,20161117060208,20161117,225,48990
COMGPRS,ZCGHW4,COMGPRS_ZCGHW4_493748.dat,processed_cdr_20161117060216_03231049.cdr,20161117060208,20161117,225,42921
COMGPRS,ALK_02P,COMGPRS_ALK_02P_030796.dat,processed_cdr_20161117055221_00032426.cdr,20161117060209,20161117,225,48149
COMGPRS,CGVEN16,COMGPRS_CGVEN16_500074.dat,processed_cdr_20161117061325_01657026.cdr,20161117060209,20161117,225,42554
COMGPRS,CGVEN08,COMGPRS_CGVEN08_770419.dat,processed_cdr_20161117061315_02136488.cdr,20161117060211,20161117,225,42232
COMGPRS,ZCGHW4,COMGPRS_ZCGHW4_493749.dat,processed_cdr_20161117060359_03231050.cdr,20161117060213,20161117,225,42849
COMGPRS,CGVEN16,COMGPRS_CGVEN16_500075.dat,processed_cdr_20161117061452_01657027.cdr,20161117060213,20161117,225,42561

I am triying to make this in shell, thats the reason of my second post

RudiC · November 18, 2016, 2:56am

You are quite economical with the facts. I'm afraid I can't be of any further help unless way more details are revealed.

There seems to be ONE SINGLE output line for the entire stream. Why, then, the uniq function?
And, please be more precise and unambiguous: which unique field is to be counted: $4 as in the pipe in post#1, $2 as commented ("for the column 2") in post#3, or the combination of $1 and $2 as in the code snippet in post#3?

What be the result for your sample data lines in post#5? Applying the pipe from that post yields 0. Please show the expected output and the logics to be applied to achieve it, in plain English.