Improve script

jiam912 · November 18, 2016, 5:55am

Gents,

Is there the possibility to improve this script to be able to have same output information.

I did this script, but I believe there is a very short code to get same output

here my script

awk -F, '{if($10>0 && $10<=15) print $6}' tmp1 | sort -k1n | awk '{a[$1]++} END { for (n in a ) print n, a[n]}' tmp1 > tmp2
awk -F, '{if($10>15.01 && $10<=25) print $6}' tmp1 | sort -k1n | awk '{a[$1]++} END { for (n in a ) print n, a[n]}' tmp1 > tmp3
awk -F, '{if($10>25.01 && $10<=35) print $6}' tmp1 | sort -k1n | awk '{a[$1]++} END { for (n in a ) print n, a[n]}'tmp1 > tmp4
awk -F, '{if($10>35.01 && $10<=50) print $6}' tmp1 | sort -k1n | awk '{a[$1]++} END { for (n in a ) print n, a[n]}' tmp1 > tmp5
awk -F, '{if($10>50.01 print $6}' tmp1 | sort -k1n | awk '{a[$1]++} END { for (n in a ) print n, a[n]}' tmp1 > tmp6

clear

awk 'NR==FNR{a[$1]=$2;next}
    {if($1 in a){print $0,a[$1];delete a[$1]}
        else print $0,"0"}
    END{for(x in a)print x,"0",a[x]}' tmp2 tmp3 | sort -k1n > tmp1a

awk 'NR==FNR{a[$1]=$2;next}
    {if($1 in a){print $0,a[$1];delete a[$1]}
        else print $0,"0"}
    END{for(x in a)print x,"0",a[x]}' tmp2 tmp4 | sort -k1n > tmp2a


awk 'NR==FNR{a[$1]=$2;next}
    {if($1 in a){print $0,a[$1];delete a[$1]}
        else print $0,"0"}
    END{for(x in a)print x,"0",a[x]}' tmp2 tmp5 | sort -k1n > tmp3a

awk 'NR==FNR{a[$1]=$2;next}
    {if($1 in a){print $0,a[$1];delete a[$1]}
        else print $0,"0"}
    END{for(x in a)print x,"0",a[x]}' tmp2 tmp6 | sort -k1n > tmp4a


paste tmp1a tmp2a tmp3a tmp4a > tmp1b

awk '{
	printf("%5s%10s%10s%10s%10s%10s\n",
		$1,$3,$2,$5,$8,$11)}' tmp1b > tmp10

the output i got:

    2      1277       123        10         1         0
    3       678        16         0         0         0
    6      1344        14         2         0         0
    8      1320         5         1         0         0
    9      1349         3         0         0         0
   10       880        34         3         0         0
   11      1380        28         0         0         0
   12       967        31         0         0         0
   13      1094         6         1         0         0
   14       832        27         1         0         0
   15      1103        33         1         0         0
   16      1485        14         1         0         0
   17       716       651        10         0         0
   18       959        16         2         0         0
   19      1493         7         1         0         0
   20      1234        28         0         0         0
   21       350         9         0         0         0
   22      1213       177        13         0         0
   24      1324        25         0         0         0
   25       276      1081         5         0         0
   26      1389        39         4         0         0
   27       282      1030         3         0         0
   28      1251        73         7         1         0

input file attached. ( I have removed some lines from the file to short the size )

Thanks for your help.

Corona688 · November 18, 2016, 10:23am

Would you mind explaining what you're actually doing? I can see that the first 5 awk/sort/awk/kitchen/sinks are separating and counting ranges but aren't sure about the rest.

jiam912 · November 18, 2016, 10:45am

Hi.
Based in the column 6 i will like to count how many times are repeated values in the column 10 for different ranges.. There 5 ranges :

$10>0 && $10<=15
$10>15.01 && $10<=25
$10>25.01 && $10<=35
$10>35.01 && $10<=50
$10>50.01

Corona688 · November 18, 2016, 11:20am

But your script doesn't check for repeats in column 10? I don't understand.

I also don't understand what the second half of your script is for.

jiam912 · November 18, 2016, 11:34am

Dear Corona688.
There is to conditions...
Count the values in column10 for each value in column 6 and split according to the 5 conditions.

The second part of the script is to merge the conditions in single output.

RudiC · November 18, 2016, 2:16pm

Although what I see is way beyond my grasp, I'd almost bet that the entire script can be done in a single awk command. Please explain step by step, seconded by (interim) data, what the processing should look like.

jiam912 · November 18, 2016, 2:55pm

Hi RudiC,

Yes, I believe the output generated with all my script can be done with only single awk command,. but i cant :).

Exactly what i wan is using the file attached.

In column 10 have some values which I want to separate as columns using the 5 conditions.

$10>0 && $10<=15
$10>15.01 && $10<=25
$10>25.01 && $10<=35
$10>35.01 && $10<=50
$10>50.01

That means i need to have 5 columns.

The other step is to count how many times for each value in column 6 has each range .. ( FYI , to be able to upload the file i reduce the lines in the input file.. then, the output results can change )

example
value 2 in column 6 has five columns which correspond to each condition

      >0-<=15   >15.01-<=25    >25.01-<=35   >35.01-<=50   >50.01
2       1277         123              10            1          0

Then, i should get something like this:

      >0-<=15   >15.01-<=25    >25.01-<=35   >35.01-<=50   >50.01
 2      1277        123             10            1         0
 3       678         16              0            0         0
 6      1344         14              2            0         0
 8      1320          5              1            0         0
 9      1349          3              0            0         0
10       880         34              3            0         0
11      1380         28              0            0         0
12       967         31              0            0         0
13      1094          6              1            0         0
14       832         27              1            0         0
15      1103         33              1            0         0
16      1485         14              1            0         0
17       716        651             10            0         0
18       959         16              2            0         0
19      1493          7              1            0         0
20      1234         28              0            0         0
21       350          9              0            0         0
22      1213        177             13            0         0
24      1324         25              0            0         0
25       276       1081              5            0         0
26      1389         39              4            0         0
27       282       1030              3            0         0
28      1251         73              7            1         0

appreciate your help.

RudiC · November 18, 2016, 3:16pm

OK, ... try

awk -F, '
BEGIN   {CNTLM = split ("15,25,35,50", LMT)
#        LMT[CNTLM+1] = "50+"
        }

        {LNC[$6]++
#        TOT++
         for (IX=1; IX<=CNTLM; IX++) if ($10 <= LMT[IX]) break
         CNT[$6,IX]++ 
        }


END     {for (l in LNC)
                {printf "%d", l
                 for (IX=1; IX<=CNTLM+1; IX++) printf "\t%d", CNT[l,IX]+0
                 printf RS
#                TOT2 += LNC[l]
                }
#        print TOT, TOT2
        }
'  tmp1
10      752     24      3       0       0
12      969     31      0       0       0
13      970     4       0       0       0
14      369     7       1       0       0
22      1036    169     12      0       0
15      920     19      0       0       0
24      1157    24      0       0       0
18      771     6       1       0       0
28      1100    48      7       1       0
2       1093    118     9       1       0
8       1197    3       1       0       0
9       1182    3       0       0       0

This is based on the assumption that

there are no negative $10 values as the lower limit 0 is not tested
no $10 value falls into the gap between, say, 50 and 50.01, as all these would not be counted in your script

For a reasonable header, you may want to add

	 printf "FLD6"
	 for (IX=1; IX<=CNTLM; IX++) printf "\t<=%s", LMT[IX]
	 printf "\t>%d" RS, LMT[--IX]

to the front of the END section.

jiam912 · November 18, 2016, 3:28pm

Amazing.. really appreciate.
Kindly, It is possible to explain the code..

RudiC · November 18, 2016, 3:43pm

awk -F, '
BEGIN   {CNTLM = split (CATS, LMT)                                      # prepare the limits array for the categories given as a script parameter below
        }

        {LNC[$6]++                                                      # line count of $6 occurrences, also used for later printing the $6 values 
         for (IX=1; IX<=CNTLM; IX++) if ($10 <= LMT[IX]) break          # calculate the column index based on a comparison between $10 and the limits array
         CNT[$6,IX]++                                                   # increment the counter for $6 value and category index (if $10 greater than 50, IX == 5)
        }


END     {printf "FLD6"                                                  # create header's first element
         for (IX=1; IX<=CNTLM; IX++) printf "\t<=%s", LMT[IX]           # and further elements i.e. less or equal category limits
         printf "\t>%d" RS, LMT[--IX]                                   # last: greater than largest limit

         for (l in LNC)                                                 # now: create lines for each $6 value (taken from LNC array)
                {printf "%d", l                                         # print the $6 value as the leftmost column
                 for (IX=1; IX<=CNTLM+1; IX++) printf "\t%d", CNT[l,IX] # print the respective category's count for this $6 value
                 printf RS                                              # new line
                }
        }
' CATS="15,25,35,50" tmp1                                               # supply the categories and the input file

EDIT:
Small error, sorry: Defining a variable as a parameter AFTER the script makes it unavailable in the BEGIN section. Either define it using the -v option before the script, or replace the BEGIN with the NR==1 pattern.

jiam912 · November 19, 2016, 5:15am

Dear RudiC.

Thanks a lot for the explanation.. really appreciate that.

Is there the option adapt the script to count even negative values in column $10?.

RudiC · November 19, 2016, 5:47am

How about trying to introduce negative limits in the CATS variable?

awk -F, -v CATS="-10,-5,0,15,25,35,50" '
.
.
.
FLD6    <=-10   <=-5    <=0     <=15    <=25    <=35    <=50    >50
2       0       1       2       1090    118     9       1       0
8       2       0       4       1191    3       1       0       0
.
.
.
22      1       0       2       1033    169     12      0       0

jiam912 · November 22, 2016, 6:12am

Thanks a lot RudiC

it is possible to add the sum for all columns ans also the sum for all rows.

---------- Post updated at 06:12 AM ---------- Previous update was at 06:10 AM ----------

awk '{ 
  TR=0
  for( I = 1; I <= NF; I++ ) {
    TR += $I
    TC += $I
    printf( "%6d", $I )
  }
  print "  = " TR
  TF = NF
}

END {
  for( I = 1; I <= TF; I++ )
    printf "%6d", TC
  print ""
}

' file

RudiC · November 22, 2016, 8:42am

In post#8, the total for the entire file was already in, albeit commented out. Why don't you - just for the exercise - try to implement your new request and post it here so we can discuss if need be?

jiam912 · November 23, 2016, 12:50pm

RudiC
Finally I did something to get the desired output it is long but works ...
here the script.

#!/bin/bash
dd=56
awk -F, -v CATS="-10,-5,0,15,25,35,50" '
NR==1   {CNTLM = split (CATS, LMT)                                      
        }
        {LNC[$6]++                                                      
         for (IX=1; IX<=CNTLM; IX++) if ($10 <= LMT[IX]) break          
         CNT[$6,IX]++                                                   
        }  
END     {printf "FLD6"                                                  
         for (IX=1; IX<=CNTLM; IX++) printf "%6d", LMT[IX]
         printf "" RS                          
           for (l in LNC)                                                 
                {printf "%4d", l                                         
                 for (IX=1; IX<=CNTLM; IX++) printf "%6d", CNT[l,IX] +0
                 printf RS                                         
                }
        }
' test1 > tmp2

awk 'NR>1{print}' tmp2 | sort -k1n > tmp3

#header
awk '
 NR==1{print}' tmp2 > header 
awk     'BEGIN  {DASH=sprintf("%'$dd's",_); gsub(/ /,"-",DASH); print DASH}' >> header 

#sum columns
awk '
 {for (i=2;i<=NF;i++) a+=$i} 
  END{for (i=2;i<=NF;i++) printf"%6s", a}' tmp3 |
  awk '{print "TbyH"$0}' > tmp4 

cat tmp3 tmp4 > tmp5 
awk 'NR>1{for(i=1;i<=NF;i++) t+=$i; print $0 "  =  " t; t=0 }' tmp5 > tmp6

awk     'BEGIN  {DASH=sprintf("%'$dd's",_); gsub(/ /,"-",DASH); print DASH}' >> tmp6

#count columns 

awk '
      {for(i=2;i<=NF;i++) a+=$i>0}
      END{for(i=2;i<=NF;i++)printf"%6s", a}' tmp3 |
  awk '{print "TViU"$0}' | awk 'END{print}' > tmp7 
awk     'BEGIN  {DASH=sprintf("%'$dd's",_); gsub(/ /,"-",DASH); print DASH}' >> tmp7
   
cat  header tmp6 tmp7 > out.txt
sed -i '/TbyH/ i --------------------------------------------------------' out.txt
rm -f *tmp*

the output here:

FLD6   -10    -5     0    15    25    35    50     TbyV
--------------------------------------------------------
   2     0     0     2  1091   118     9     1  =  1223
   8     0     0     4  1193     3     1     0  =  1209
   9     0     0     2  1180     3     0     0  =  1194
  10     0     0     2   750    24     3     0  =  789
  12     0     0     2   967    31     0     0  =  1012
  13     0     0     3   967     4     0     0  =  987
  14     0     0     0   369     7     1     0  =  391
  15     0     0     4   916    19     0     0  =  954
  18     0     0     2   769     6     1     0  =  796
  22     0     0     2  1034   169    12     0  =  1239
  24     0     0     3  1154    24     0     0  =  1205
  28     0     0     0  1100    48     7     1  =  1184
--------------------------------------------------------
TbyH     0     0    26 11490   456    34     2  =  12008
--------------------------------------------------------
TViU     0     0    10    12    12     7     2
--------------------------------------------------------

please try to short the script to get the desired output.

Thanks a lot.

RudiC · November 23, 2016, 2:26pm

How about

awk -F, '
BEGIN   {CNTLM = split ("-10,-5,0,15,25,35,50", LMT)
#        LMT[CNTLM+1] = "50+"
         cmdsort = "sort -n"
         DASHES = sprintf ("%0*d", 8*(CNTLM+3), _)
         gsub (/0/, "-", DASHES)
        }

        {for (IX=1; IX<=CNTLM; IX++) if ($10 <= LMT[IX]) break
         CNT[$6,IX]++ 
         COLTOT[IX]++
         LNC[$6]++
         TOT++
         }

END     {printf "FLD6"
         for (IX=1; IX<=CNTLM; IX++) printf "\t<=%s", LMT[IX]
         printf "\t>%d\t  TbyV" RS, LMT[--IX]
         print DASHES

         for (l in LNC)
                {printf "%d", l | cmdsort
                 for (IX=1; IX<=CNTLM+1; IX++)  {printf "\t%d", CNT[l,IX]+0 | cmdsort
                                                 NNUL[IX] += CNT[l,IX]>0
                                                }
                 printf "\t= %d" RS, LNC[l] | cmdsort
                }
         close (cmdsort)

         print DASHES
         printf "TbyH"
         for (IX=1; IX<=CNTLM+1; IX++) printf "\t%d", COLTOT[IX]+0
         printf "\t= %d" RS, TOT

         print DASHES
         printf "TViU"
         for (IX=1; IX<=CNTLM+1; IX++) printf "\t%d", NNUL[IX]
         printf RS
        }
'  tmp1
FLD6	<=-10	<=-5	<=0	<=15	<=25	<=35	<=50	>50	  TbyV
--------------------------------------------------------------------------------
2	0	0	2	1091	118	9	1	0	= 1221
8	0	0	4	1193	3	1	0	0	= 1201
9	0	0	2	1180	3	0	0	0	= 1185
10	0	0	2	750	24	3	0	0	= 779
12	0	0	2	967	31	0	0	0	= 1000
13	0	0	3	967	4	0	0	0	= 974
14	0	0	0	369	7	1	0	0	= 377
15	0	0	4	916	19	0	0	0	= 939
18	0	0	2	769	6	1	0	0	= 778
22	0	0	2	1034	169	12	0	0	= 1217
24	0	0	3	1154	24	0	0	0	= 1181
28	0	0	0	1100	48	7	1	0	= 1156
--------------------------------------------------------------------------------
TbyH	0	0	26	11490	456	34	2	0	= 12008
--------------------------------------------------------------------------------
TViU	0	0	10	12	12	7	2	0

jiam912 · November 24, 2016, 5:59am

Thanks many thanks....

It is awesome... Appreciate your help..

---------- Post updated 11-24-16 at 05:59 AM ---------- Previous update was 11-23-16 at 03:15 PM ----------

Hi RudiC,

it is possible to replace only in the selection (values in blue). Values which are = 0 using this code.

awk '{gsub(" 0"," -")}'

.
I was trying to adapt it to the script but i cant.

FLD6	<=-10	<=-5	<=0	<=15	<=25	<=35	<=50	>50	  TbyV
--------------------------------------------------------------------------------
2	0	0	2	1091	118	9	1	0	= 1221
8	0	0	4	1193	3	1	0	0	= 1201
9	0	0	2	1180	3	0	0	0	= 1185
10	0	0	2	750	24	3	0	0	= 779
12	0	0	2	967	31	0	0	0	= 1000
13	0	0	3	967	4	0	0	0	= 974
14	0	0	0	369	7	1	0	0	= 377
15	0	0	4	916	19	0	0	0	= 939
18	0	0	2	769	6	1	0	0	= 778
22	0	0	2	1034	169	12	0	0	= 1217
24	0	0	3	1154	24	0	0	0	= 1181
28	0	0	0	1100	48	7	1	0	= 1156
--------------------------------------------------------------------------------
TbyH	0	0	26	11490	456	34	2	0	= 12008
--------------------------------------------------------------------------------
TViU	0	0	10	12	12	7	2	0

RudiC · November 24, 2016, 6:10am

Show your adaption so we can discuss. Please note there's no such string " 0" in the output so your gsub must fail.

And - DON'T pipe through a new awk script - it can and should be done in the original script.

jiam912 · November 27, 2016, 11:41am

RudiC,
I add the average at the script for all columns.

          print DASHES
         printf "TbyH"
         for (IX=1; IX<=CNTLM+1; IX++) printf "\t%2.1f", COLTOT[IX]/TOT*100

Please can you add min and max for all columns... Please

---------- Post updated at 11:41 AM ---------- Previous update was at 11:37 AM ----------

I found this code... But will be nice if you add this output in your code..

 awk 'NR==1{for (i=2; i<=NF; i++) {min=$i}; next}
     {for (i=2; i<=NF; i++) { min=(min>$i?$i:min); max=(max<$i?$i:max) }}
     END {printf "min: "; for (i=1;i<=NF;i++) printf "%g %s", min, (i==NF?"\n":" "); 
          printf "max: "; for (i=1;i<=NF;i++) printf "%g %s", max, (i==NF?"\n":" ")}'

RudiC · November 27, 2016, 2:36pm

I had the impression that you might have understood the little program, seconded by the proposal you made for the average (except for the too small float format). What keeps you from applying the same structure to the min/max problem? And, what about the dash replacement for the zeroes?