Help to optimize script running time

Dear Forum experts

I have the below script which I made to run under bash shell, it runs perfectly for low records number, let us say like 100000. when I put all records (3,000,000), it's takes hours

can you please suggest anything to optimize or to run in different way :expressionless:

{OFS="|"; FS=";"; n=split("ZainEazy;Ezlink;EzlinkDuo;ZainThawani;Day;Night;Zain_Super;Zony;BlineGovernment;Army;Bline;ZainS7abak;ZainCallAsia;FlateRate;ZainF
urat;Ziyarah;SubDealers;Zain5;Zain5NOSuperNO;Zain5Disc;ZainElKul;GelnaZBonus;NonOfficialBLGvD;Visitors;StaffLine;OfficialBlineGov;Mo7afazat;Zain5Xtra;Zain5Xt
raNoSupNo;Zain5XtraDisc;Zain500;Zain500NOSuperNO;Zain500Disc;ZainNile;Jaishana;Aqaba;Ayla;ZainQuattro", arr,";")}

{

if (substr($2,7, 1)%2==0){
Subs_IN4[$1]+=1;
SubStr_IN4[$2]=2;
for ( i=57 ; i<= NF; i++ ) {SubStr_IN4[$2]=SubStr_IN4[$2]";"$i};

if (SubStr_IN4[$2] ~ /FnF_1/){ split($34,a,"|");
for ( i=0 ; i<= 9; i++ )
{if (a ~ /00/){ FnFGroup1_IN4[$1]=FnFGroup1[$1]+1}}}else {FnFGroup1[$1]="NA"};

if (SubStr_IN4[$2] ~ /FnF_2/){ split($37,b,"|");
for ( i=0 ; i<= 9; i++ )
{if (b ~ /00/){ FnFGroup2_IN4[$1]=FnFGroup2[$1]+1}}}else {FnFGroup2[$1]="NA"};

if (SubStr_IN4[$2] ~ /FnF_3/){ split($39,c,"|");
for ( i=0 ; i<= 9; i++ )
{if (c ~ /00/){ FnFGroup3_IN4[$1]=FnFGroup3[$1]+1}}}else {FnFGroup3[$1]="NA"};


if($3 > 0 && $18=="TRUE"){POS_IN4[$1]+=1; SOLD_IN4[$1]+=1}
   else{if($3 == 0 && $18=="TRUE"){ZERO_IN4[$1]+=1; SOLD_IN4[$1]+=1}
            else{if($18=="FALSE"){NOTSOLD_IN4[$1]+=1}}}

        if ($18=="TRUE" && $7=="Active"){ACTIVE_IN4[$1]+=1}
        else{if ($18=="TRUE" && $7=="IncomingCallsOnly"){GRACE_IN4[$1]+=1}
        else{if ($18=="TRUE" && $7=="RechargeOnly"){RECHARGE_IN4[$1]+=1}
                else{if ($18=="TRUE" && $7=="Transient"){TRANSIET_IN4[$1]+=1}}}}

        if ($7=="Active"){A_M_IN4[$1]+=$3; A_2ND_IN4[$1]+=$9; A_3RD_IN4[$1]+=$11;A_4TH_IN4[$1]+=$13}
        if ($7=="IncomingCallsOnly"){E_M_IN4[$1]+=$3; E_2ND_IN4[$1]+=$9; E_3RD_IN4[$1]+=$11;E_4TH_IN4[$1]+=$13}

        if ($9 > 0  ){Second_IN4[$1]+=1}
                                if ($11 > 0  ){Third_IN4[$1]+=1}
                                if ($13 > 0 ){Fourth_IN4[$1]+=1}
}

print arr"_IN4", Subs_IN4[arr], POS_IN4[arr], ZERO_IN4[arr], SOLD_IN4[arr], NOTSOLD_IN4[arr], TRANSIET_IN4[arr], ACTIVE_IN4[arr], GRA
CE_IN4[arr], RECHARGE_IN4[arr], Second_IN4[arr],Third_IN4[arr],Fourth_IN4[arr], A_M_IN4[arr], A_2ND_IN4[arr], A_3RD_IN4[arr], A_4TH_I
N4[arr], E_M_IN4[arr], E_2ND_IN4[arr], E_3RD_IN4[arr], E_4TH_IN4[arr], FnFGroup1_IN4[arr], FnFGroup2_IN4[arr], FnFGroup3_IN4[arr]

print arr"_IN5", Subs_IN5[arr], POS_IN5[arr], ZERO_IN5[arr], SOLD_IN5[arr], NOTSOLD_IN5[arr], TRANSIET_IN5[arr], ACTIVE_IN5[arr], GRA
CE_IN5[arr], RECHARGE_IN5[arr], Second_IN5[arr],Third_IN5[arr],Fourth_IN5[arr], A_M_IN5[arr], A_2ND_IN5[arr], A_3RD_IN5[arr], A_4TH_I
N5[arr], E_M_IN5[arr], E_2ND_IN5[arr], E_3RD_IN5[arr], E_4TH_IN5[arr], FnFGroup1_IN5[arr], FnFGroup2_IN5[arr], FnFGroup3_IN5[arr]

print arr"_TOTAL", Subs[arr_IN4]+Subs[arr_IN5], POS_IN4[arr]+POS_IN5[arr], ZERO_IN4[arr]+ZERO_IN5[arr], SOLD_IN4[arr]+SOLD_IN5[arr
], NOTSOLD_IN4[arr]+NOTSOLD_IN5[arr], TRANSIET_IN4[arr]+TRANSIET_IN5[arr], ACTIVE_IN4[arr]+ACTIVE_IN5[arr], GRACE_IN4[arr]+GRACE_IN5[arr
], RECHARGE_IN4[arr]+RECHARGE_IN5[arr], Second_IN4[arr]+Second_IN5[arr],Third_IN4[arr]+Third_IN5[arr],Fourth_IN4[arr]+Fourth_IN5[arr[
i]], A_M_IN4[arr]+A_M_IN5[arr], A_2ND_IN4[arr]+A_2ND_IN5[arr], A_3RD_IN4[arr]+A_3RD_IN5[arr], A_4TH_IN4[arr]+A_4TH_IN5[arr], E_M_IN4[
arr]+E_M_IN5[arr], E_2ND_IN4[arr]+E_2ND_IN5[arr], E_3RD_IN4[arr]+E_3RD_IN5[arr], E_4TH_IN4[arr]+E_4TH_IN5[arr], FnFGroup1_IN4[arr]
+FnFGroup1_IN5[arr], FnFGroup2_IN4[arr]+FnFGroup2_IN5[arr], FnFGroup3_IN4[arr]+FnFGroup3_IN5[arr]
}

This is very hard to read. I assume it is an awk script. There is a lot of what seems to be repeated logic.

If you give us a few lines of sample:
input
expected output

We can probably help more effectively

Thank you Jim

Input data sample:

Zain500Disc;46464564;560;;0;0;Active;2011-02-04 22:59:00;0;1970-01-01 00:00:00;0;1970-01-01 00:00:00;0;1970-01-01 00:00:00;1970-01-01 00:00:00;2011-03-06 22:59:00;2011-06-05 22:59:00;TRUE;FALSE;0;0;0000;FALSE;TRUE;false;FALSE;0;true;0;true;true;TRUE;false;{00962795901649|00962796371949|00962796859686|00962795293754|00962796859676};0;TRUE;{00963966107669};0;TRUE;{};0;TRUE;;false;TRUE;FALSE;FALSE;2;FALSE;FALSE;;0;0;0;0;0;2;(FnF_1,-108,01.01.2025 23:59:59:999);(FnF_2,-108,01.01.2025 23:59:59:999);

ZainEazy;4646464;1;2000;0;0;Active;2016-09-10 22:59:00;0;2006-12-11 22:59:00;0;2009-05-26 22:59:00;0;1970-01-01 00:00:00;1970-01-01 00:00:00;2016-10-10 22:59:00;2017-01-09 22:59:00;TRUE;FALSE;0;0;0000;FALSE;TRUE;false;FALSE;0;true;0;true;true;TRUE;false;{};2;TRUE;{};0;TRUE;{};0;TRUE;;false;TRUE;FALSE;FALSE;2;FALSE;FALSE;;0;0;0;0;0;2;(FnF_1,-108,01.01.2025 23:59:59:999);

Jaishana;34535353;2776;;0;0;Active;2011-04-23 23:59:59;0;2006-03-07 23:59:00;0;2010-05-31 23:59:00;0;1970-01-01 00:00:00;1970-01-01 00:00:00;2011-05-23 23:59:59;2011-08-22 23:59:59;TRUE;FALSE;0;0;0000;FALSE;TRUE;false;FALSE;0;true;0;true;true;TRUE;false;{};2;TRUE;{};0;TRUE;{};0;TRUE;102;TRUE;TRUE;FALSE;FALSE;2;FALSE;FALSE;;0;0;0;0;0;2;(FnF_1,-108,01.01.2025 23:59:59:999);(CUG,-128,01.01.2025 00:00:00:000);

ZainQuattro;43534535;6406;4000;0;0;Active;2011-01-14 22:59:00;0;1970-01-01 00:00:00;0;2010-04-26 23:59:00;0;1970-01-01 00:00:00;1970-01-01 00:00:00;2011-02-13 22:59:00;2011-05-15 22:59:00;TRUE;FALSE;0;0;0000;FALSE;TRUE;false;FALSE;0;true;0;true;true;TRUE;false;{00962795500047|00962795600207|00962799106309|00962795782960};0;TRUE;{00963941950270|00963947278825|00963966531175};0;TRUE;{};0;TRUE;;false;TRUE;FALSE;FALSE;2;FALSE;FALSE;;0;0;0;0;0;2;(FnF_1,-108,01.01.2025 23:59:59:999);(FnF_2,-108,01.01.2025 23:59:59:999);

This is a sample of input date, the program simply group lines and count them based on field number 1... field 1 is used as index for all arrays used..
it's in awk.... but I dont know why it's very slow, even it's fast somehow for low records !!

With an awk script that long it's hard to tell what you're doing.

If all you really want to do is count uses of the first column:

BEGIN { FS=";" }

{        if(length($0) > 0)
                count[$1]++;
}

END {
        for(keys in count)
                print keys ":" count[keys];
}

For your input data, this prints:

Zain500Disc:1
ZainEazy:1
ZainQuattro:1
Jaishana:1

---------- Post updated at 08:09 PM ---------- Previous update was at 07:48 PM ----------

It's hard to "optimize" huge amounts of logic since the slowdown may not be in one important place but in the logic itself. "optimizing" it means pretty much replacing it. Here I would use sort:

BEGIN { FS=";" ; count=1 ; cur=""; }

{
        if(length($0) > 0)
        {
                if(cur == $1)   count++;
                else
                {
                        if(length(cur) > 0)
                                print cur " had " count "\n";

                        cur=$1;         count=1;
                }

                print $0;
        }
}

END {   if(length(cur) > 0)     print cur " had " count "\n";   }
sort < input | awk -f count.awk

That way, you get your records already grouped and just have to count when things change.

I would go with Corona's approach. For only 300000 records the array example is great. (the first code bit)
We do that everyday with 1M record files. Takes 30 seconds on a Solaris v445.

Don't know what you want to achieve but it looks like commands like the following could help a bit

awk -F";" '{print$1}' input | sort | uniq -c
sort -t ";" -k 1 input

This script does not make a whole lot of sense. Array arr is initialized everytime, but this could be done in a begin statement. Yet it is only referenced as arr[i], but i never gets set explicitly, sometimes by accident it assumes the value of 10 because it is used as a counter in a for loop that does or does not get run depending on if conditions, so then arr[10] gets called which produces army