Calculate percentage of a value accross m

I have 100 csv files like:

file_city_1 file_city_2 file_city_3 file_city_4

City name is variable, there is 25 cities, each city has 4 region. Each of the 4 region contain some statistics like:

parameter1 : number1
    parameter1 : number2
    .....
    parameter50 : number50

So I should calculate each region's each parameter percentage in city total for this parameter. So what I want to do is something something like:

    file_city_parameter1_total = parameter1_region1 + parameter1_region2 + parameter1_region3 + parameter1_region4

Then calculate percentage for this specific parameter for each region:

 file_city_region_parameter1_percentage = parameter1_region1 / file_city_parameter1_total * 100 

then echo all parameters percentage over total (sum of 4 regions) for a specific region and should do it for regions (all 100 files)

I actually tried multiple codes but I don't want to paste them here as they are way long to read. I defined all parameters separately for each region, parameters, tried different command substitution and nested loop. But stuck at some point. So I'm wondering what is the easiest way to accomplish this work done with awk, grep, sed or anything else.

Thanks.

Whether it is possible for you to attach 1 sample file ?

Something like this might be you wanted, I will not guarantee you that it will work for your data, since you didn't show file structure, it's just a guess

for i in file_city_* ; do
      awk -F":" '{ sum+=$2; var[$1]=$2 }END{ for( i in var ) print "file_city_region_"i"_percentage = ", var/sum*100 }' $i
done

this is the parameters I copied from the original file:

 cat network_stats_miami_1.csv 
Total Subs                                : 0
VLAN Count                                         : 28129501
Subs Segmentation                                     : 28142
ARPU Segmentation                                         : 0
RTT Delay                             : 3096610
IPv4 Fragmented Count                              : 2809853525
IPv4 Non Fragmented Count                          : 2809853525
Call Drop Rate                                : 1
Connection with Good voice quality                  : 45545345
Standalone Dedicated Control Channel (SDCCH) Congestion : 0   

.
.
.

The format is like that.

---------- Post updated at 07:09 AM ---------- Previous update was at 06:49 AM ----------

This will calculate the summary of all KPI's right? But I want to calculate each KPI's percentage separately. Each KPI's are located in same line for each file.

l also thought something like:

for city in ${CITY} ; do
for region in {$REGION} ; do
for kpi in ${KPI} ; do

#calculate sum of kpi 

eval ${kpi}_${city}_total=$( awk - F ":" '/$kpi/{sum=sum+$2} END {print sum}' ${csv_dir}/${file_base}_${city}* )

# parse a single region kpi value:

eval ${kpi}_${city}_${region}=$( awk -F ":"  '/$kpi/{print $2}' ${csv_dir}/${file}_${city}_${region}.csv )

#then calculate percentage

eval ${kpi}_${city}_${region}_percentage=$( awk -v  var1=$(eval echo \${kpi}_${city}_${region}) -v var2=$(eval echo \${kpi}_${city}_total) 'BEGIN{print var1 / var2 * 100}' )

And I'm stuck at this point. My script calculates kpi total and region value correctly but it is creating some error during the calculation of percentage.

It is generating an output like nun. I know this is not easiest way to resolve this, so I wanted to take the opinions of the others.

What's your expected output for above input ?

Well for example I want to first sum all Total Subs values in all of the 4 region for the city. Then calculate the percentage of this region's Total Subs value over the city total ( 4 regions) Do same thing for each kpi, each region. And finally I want to see an output like:

Miami 

KPI Name                           Region1 Region2 Region3 Region4
Total Subs                              20%       30%      40%       10%
VLAN Count                            15%       40%      40%         5%
Subs Segmentation                80%       5%        5%          10%
.
.
.
.
Chicago

KPI Name                           Region1 Region2 Region3 Region4
Total Subs                              20%       30%      40%       10%
VLAN Count                            15%       40%      40%         5%
Subs Segmentation                80%       5%        5%          10%


And so on.

if your real file contains region like this try , I just pasted same contents 4 times in 2 file and tested

$ cat file_test1.tmp
Total Subs : 0
VLAN Count : 28129501
Subs Segmentation : 28142
ARPU Segmentation : 0
RTT Delay : 3096610
IPv4 Fragmented Count : 2809853525
IPv4 Non Fragmented Count : 2809853525
Call Drop Rate : 1
Connection with Good voice quality : 45545345
Standalone Dedicated Control Channel (SDCCH) Congestion : 0

Total Subs : 0
VLAN Count : 28129501
Subs Segmentation : 28142
ARPU Segmentation : 0
RTT Delay : 3096610
IPv4 Fragmented Count : 2809853525
IPv4 Non Fragmented Count : 2809853525
Call Drop Rate : 1
Connection with Good voice quality : 45545345
Standalone Dedicated Control Channel (SDCCH) Congestion : 0

Total Subs : 0
VLAN Count : 28129501
Subs Segmentation : 28142
ARPU Segmentation : 0
RTT Delay : 3096610
IPv4 Fragmented Count : 2809853525
IPv4 Non Fragmented Count : 2809853525
Call Drop Rate : 1
Connection with Good voice quality : 45545345
Standalone Dedicated Control Channel (SDCCH) Congestion : 0



Total Subs : 0
VLAN Count : 28129501
Subs Segmentation : 28142
ARPU Segmentation : 0
RTT Delay : 3096610
IPv4 Fragmented Count : 2809853525
IPv4 Non Fragmented Count : 2809853525
Call Drop Rate : 1
Connection with Good voice quality : 45545345
Standalone Dedicated Control Channel (SDCCH) Congestion : 0
$ cp file_test1.tmp file_test2.tmp 
$ ls *.tmp -1
file_test1.tmp
file_test2.tmp
awk -F":" ' 

function stat(){
        # print file being used
        print p RS 

        for(i=1;i<=j;i++){
                if(C[i,1]){

                            # If divisible by zero then zero
                            per = C[i,2] == 0 ? "0" : C[C[i,1],3]/(C[i,2]*100)

                            X[C[i,1]] = X[C[i,1]] ? X[C[i,1]] OFS per"%" : per"%"
                    
                            # Variable for future use to print parameter in same order
                             stop = split(X[C[i,1]],Z,OFS) == 1 ?  i : stop 
                           }
                        }

        # Count how many regions are there in file 
        region = split(X[C[i-1,1]],Z,OFS)

        for(i=1;i<=region;i++){
                               printf i==1 ? "Parameter" OFS "Region"i OFS : \
                               i < region ? "Region"i OFS : "Region"i RS
                              }
        
           for(i=1;i<=stop;i++)
                    print C[i,1],X[C[i,1]];print RS
        
                   }

      FNR==1{
               if(NR != 1){
                
                          # Call for each file 
                          stat()

                          # Reset for each file
                          j = region = stop = 0
                         delete X ;  delete C; delete Z
                           }
              p = FILENAME
            } 

        {
             # Stores Parameter
             C[++j,1] = $1

             # Stores Parameter Value
             C[j,2]   = $2

             # Sum of each Parameter
             C[C[j,1],3] += C[j,2]
        }

     END{
          stat()
        }
    ' OFS=\, *.tmp

file_test1.tmp

Parameter,Region1,Region2,Region3,Region4
Total Subs ,0%,0%,0%,0%
VLAN Count ,0.04%,0.04%,0.04%,0.04%
Subs Segmentation ,0.04%,0.04%,0.04%,0.04%
ARPU Segmentation ,0%,0%,0%,0%
RTT Delay ,0.04%,0.04%,0.04%,0.04%
IPv4 Fragmented Count ,0.04%,0.04%,0.04%,0.04%
IPv4 Non Fragmented Count ,0.04%,0.04%,0.04%,0.04%
Call Drop Rate ,0.04%,0.04%,0.04%,0.04%
Connection with Good voice quality ,0.04%,0.04%,0.04%,0.04%
Standalone Dedicated Control Channel (SDCCH) Congestion ,0%,0%,0%,0%


file_test2.tmp

Parameter,Region1,Region2,Region3,Region4
Total Subs ,0%,0%,0%,0%
VLAN Count ,0.04%,0.04%,0.04%,0.04%
Subs Segmentation ,0.04%,0.04%,0.04%,0.04%
ARPU Segmentation ,0%,0%,0%,0%
RTT Delay ,0.04%,0.04%,0.04%,0.04%
IPv4 Fragmented Count ,0.04%,0.04%,0.04%,0.04%
IPv4 Non Fragmented Count ,0.04%,0.04%,0.04%,0.04%
Call Drop Rate ,0.04%,0.04%,0.04%,0.04%
Connection with Good voice quality ,0.04%,0.04%,0.04%,0.04%
Standalone Dedicated Control Channel (SDCCH) Congestion ,0%,0%,0%,0%


if you need tab separated file then at the end change OFS=\, to OFS=\\t , if you are trying on Solaris/Sun OS use nawk

Hi Akshay;

I first appreciate for long answer and code.

My real file doesn't contain any region field. It's just in the name of file.

For example: file_miami_1 is region 1, file_miami_2 is region 2, file_miami_3 region 3 and file_miami_4 is region 4. For each city there is 4 region, this number is constant. And the kpi percentage for each region should be something like kpi1_city1_region1 / ( kpi1_city1_region1 + kpi1_city1_region2 + kpi1_city1_region3 + kpi1_city1_region4 ) * 100

and this number is let say kpi1_city1_region1_percentage

For each kpi I want each kpi is displayed on the same line like:

City1 Region1 Percentages:

KPI Name Reg1% Reg2% Reg3% Reg4%
Total Subs 0% 10% 20% 70%
Vlan Count 10% 20% 20% 60%

..... (for all kpi's)

And then same thing for city1 region2, region3, region4 (a separate table for each city_region)

Then the next city city2_region1, city2_region2, city2_region3, city2_region4

until

city25_region1, city25_region2, city_25_region3, city25_region4

So the program should display in total of 100 percentage table for 25 cities, 4 regions.

This is what your code does?

Sorry I can't spend more time on guessing, your input.Others might answer you.

Thank you so much :b: