Problem with getting awk to multiply a field by a value set based on condition of another field

cotilloe · February 2, 2020, 1:21pm

Hi,

So awk is driving me crazy on this one. I have searched everywhere and read man, docs and every related post Google can find and still no luck. The actual files I need to run this on are sensitive in nature, but it is the same thing as if I needed to calculate weighted grades for multiple students using all assignments as input. (Sample below)

The input file is a csv. I took all of the assignment names from $3 and calculated the highest grade, lowest grade and average grade for each one. the code I used to do this is:

awk -F , 'NR>1 { if(!($3 in course)) { low[$3] = high[$3] = $4 }
        if ($4 < low[$3]) low[$3] = $4;
        if ($4 > high[$3]) high[$3] = $4;
        sum[$3] += $4;
        ++course[$3] }
    END { OFS="\t"; print "Name", "Low", "High", "Avg";
        for (k in course)
          printf "%s\t%d\t%d\t%.2f\n",  k, low[k], high[k], sum[k]/course[k] }' data.csv

That gave me the desired out put. Using the same input file, I want to now group by the Student name and then give weight to each assignment grade.

Homework = 0.10
Lab = 0.30
Quiz = 0.40
Final = 0.15
Survey= 0.05

so the logic is this: If $2 = "Homework" then $4 = $4*0.10 and so on for each assignment Category. The all I need to do is sum all of the $4 for each student individually. On top of this, I also need to assign a letter grade based on the value of each student's total class grade, so I am not sure even how to proceed there at all.

However, I cannot make this happen and return all kinds of goofy results, including somehow printing the output headers several times with nothing else and they are set before anything else and are no where near a loop. So, I am totally confused. Here is the most recent failure:

awk -F, '{print "Name\tPercent\tGrade\n"}
NR>1{for(i=1;i<=NR;i++)
         { if ( $2 == "Quiz" ) w=0.4 ;
           if ( $2 == "Lab" ) w=0.3 ;
           if ( $2 == "Homework" ) w=0.1 ;
           if ( $2 == "Final" ) w=0.15 ;
           if ( $2 == "Survey" ) w=0.15 ;
      } }
END {
      a[$1]=$2;
      b[$4]=$4*$w
      for (k in a) printf "%s\t%d\n", k, a[k] ;
       }' data.csv

So, if anyone can help me out here, I would appreciate it.

The Desired output is:

Name Percent Letter Grade

INPUT FILE:
data.csv

Student	Category	Assignment	Score	Possible
Chelsey	Final	FINAL	82	100
Sam	Final	FINAL	58	100
Andrew	Final	FINAL	99	100
Ava	Final	FINAL	99	100
Shane	Final	FINAL	90	100
Chelsey	Homework	H01	90	100
Chelsey	Homework	H02	89	100
Chelsey	Homework	H03	77	100
Chelsey	Homework	H04	80	100
Chelsey	Homework	H05	82	100
Chelsey	Homework	H06	84	100
Chelsey	Homework	H07	86	100
Sam	Homework	H01	19	100
Sam	Homework	H02	82	100
Sam	Homework	H03	95	100
Sam	Homework	H04	46	100
Sam	Homework	H05	82	100
Sam	Homework	H06	97	100
Sam	Homework	H07	52	100
Andrew	Homework	H01	25	100
Andrew	Homework	H02	47	100
Andrew	Homework	H03	85	100
Andrew	Homework	H04	65	100
Andrew	Homework	H05	54	100
Andrew	Homework	H06	58	100
Andrew	Homework	H07	52	100
Ava	Homework	H01	55	100
Ava	Homework	H02	95	100
Ava	Homework	H03	84	100
Ava	Homework	H04	74	100
Ava	Homework	H05	95	100
Ava	Homework	H06	84	100
Ava	Homework	H07	55	100
Shane	Homework	H01	50	100
Shane	Homework	H02	60	100
Shane	Homework	H03	70	100
Shane	Homework	H04	60	100
Shane	Homework	H05	70	100
Shane	Homework	H06	80	100
Shane	Homework	H07	90	100
Chelsey	Lab	L01	91	100
Chelsey	Lab	L02	100	100
Chelsey	Lab	L03	100	100
Chelsey	Lab	L04	100	100
Chelsey	Lab	L05	96	100
Chelsey	Lab	L06	80	100
Chelsey	Lab	L07	81	100
Sam	Lab	L01	41	100
Sam	Lab	L02	85	100
Sam	Lab	L03	99	100
Sam	Lab	L04	99	100
Sam	Lab	L05	0	100
Sam	Lab	L06	0	100
Sam	Lab	L07	0	100
Andrew	Lab	L01	87	100
Andrew	Lab	L02	45	100
Andrew	Lab	L03	92	100
Andrew	Lab	L04	48	100
Andrew	Lab	L05	42	100
Andrew	Lab	L06	99	100
Andrew	Lab	L07	86	100
Ava	Lab	L01	66	100
Ava	Lab	L02	77	100
Ava	Lab	L03	88	100
Ava	Lab	L04	99	100
Ava	Lab	L05	55	100
Ava	Lab	L06	66	100
Ava	Lab	L07	77	100
Shane	Lab	L01	90	100
Shane	Lab	L02	0	100
Shane	Lab	L03	100	100
Shane	Lab	L04	50	100
Shane	Lab	L05	40	100
Shane	Lab	L06	60	100
Shane	Lab	L07	80	100
Chelsey	Quiz	Q01	100	100
Chelsey	Quiz	Q02	100	100
Chelsey	Quiz	Q03	98	100
Chelsey	Quiz	Q04	93	100
Chelsey	Quiz	Q05	99	100
Chelsey	Quiz	Q06	88	100
Chelsey	Quiz	Q07	100	100
Sam	Quiz	Q01	91	100
Sam	Quiz	Q02	85	100
Sam	Quiz	Q03	33	100
Sam	Quiz	Q04	64	100
Sam	Quiz	Q05	54	100
Sam	Quiz	Q06	95	100
Sam	Quiz	Q07	68	100
Andrew	Quiz	Q01	25	100
Andrew	Quiz	Q02	84	100
Andrew	Quiz	Q03	59	100
Andrew	Quiz	Q04	93	100
Andrew	Quiz	Q05	85	100
Andrew	Quiz	Q06	94	100
Andrew	Quiz	Q07	58	100
Ava	Quiz	Q01	88	100
Ava	Quiz	Q02	99	100
Ava	Quiz	Q03	44	100
Ava	Quiz	Q04	55	100
Ava	Quiz	Q05	66	100
Ava	Quiz	Q06	77	100
Ava	Quiz	Q07	88	100
Shane	Quiz	Q01	70	100
Shane	Quiz	Q02	90	100
Shane	Quiz	Q03	100	100
Shane	Quiz	Q04	100	100
Shane	Quiz	Q05	80	100
Shane	Quiz	Q06	80	100
Shane	Quiz	Q07	80	100
Chelsey	Survey	WS	5	5
Sam	Survey	WS	5	5
Andrew	Survey	WS	5	5
Ava	Survey	WS	5	5
Shane	Survey	WS	5	5

Scrutinizer · February 2, 2020, 2:23pm

Hi some quick thought on the last code snippet:
In awk the middle section is processed per line, so you should leave out:
for(i=1;i<=NR;i++)
The results should be stored in arrays so they can be used in the END section.
The END section contains code after all lines have been read in the middle section,
so the following has no business there:

      a[$1]=$2;
      b[$4]=$4*$w

cotilloe · February 2, 2020, 2:38pm

Ahh!! Thanks for that info. It explains why I get the repeated output headers then. Also, I did not realize that about the END statement. I understood it as you do not perform anything in BEGIN, but never knew that calcs and stuff should be done before END.

My biggest issue is that I am not sure of how to take an associative array and have another array stored within it. Basically I need to have it be Student_Name[Assignment_Category{Assignment Scores] where it would look like this:

Steve
Lab --- Homework ----- Quiz ---- Final ----Survey
44 ---------- 98 ---------- 78 -------- 88 ------- 5
66 --------- 100 ---------- 85
77 ---------- 88 ---------- 92
86 ---------- 77 ---------- 77

So then I have all of Steve's assignments and their grades, then i can multiply each of the grades by the appropriate weight

nezabudka · February 2, 2020, 3:02pm

Hi
I honestly didn't understand anything except the first part of the program.
So I'll just tweak the style to make it better to read

awk -F, '
NR == 1         { next }
!course[$3]     { low[$3] = $4 }
$4 < low[$3]    { low[$3] = $4 }
$4 > high[$3]   { high[$3] = $4 }
                { sum[$3] += $4; ++course[$3] }
END             { print "Name", "Low", "High", "Avg"
                  for (k in course)
                    printf "%s\t%d\t%d\t%.2f\n",  k, low[k], high[k], sum[k]/course[k]
                }
' OFS='\t' data.csv

RudiC · February 2, 2020, 3:06pm

How far would this get you, printing the weighted total for each student. The weight per category is delivered in file1 in the form you posted :

awk '
FNR == NR       {WEIGHT[$1] = $3
                 next
                }
FNR == 1        {next
                }
                {SUM[$1] += $4 * WEIGHT[$2]
                }
END             {for (s in SUM) print s, SUM
                }
' OFS="\t" file1 file2
Sam     349.45
Chelsey 536.95
Andrew  402.6
Shane   427.75
Ava     434.5

Be aware that nothing is known on the grade calculation algorithm nor the "possible total" that might be needed to calculate it.

cotilloe · February 2, 2020, 3:20pm

That actually helps alot, as I had tried something like that. What I tried to do was:

SUM[$1] += $4*0.1

That did not work, but by storing the weights in in an array, I see how it could be workable. My only question is how to store/use multiple values.
If $2 = Homework then $4 needs to be multiplied by 0.1
if $2 = Quiz then $4 needs to be multiplied by 0.4
if $2 = Lab then $4 needs to be multiplied by 0.3
if $2 = Final then $4 needs to be multiplied by 0.15
if $2 = Survey then $4 needs to be multiplied by 0.05

Then, all I would need to do is some $4 with the new values and have the overall percent for each student.

RudiC · February 2, 2020, 3:25pm

cotilloe:

That actually helps alot, as I had tried something like that. What I tried to do was:
SUM[$1] += $4*0.1
That did not work, but by storing the weights in in an array, I see how it could be workable. My only question is how to store/use multiple values.
If $2 = Homework then $4 needs to be multiplied by 0.1
if $2 = Quiz then $4 needs to be multiplied by 0.4
if $2 = Lab then $4 needs to be multiplied by 0.3
if $2 = Final then $4 needs to be multiplied by 0.15
if $2 = Survey then $4 needs to be multiplied by 0.05

Then, all I would need to do is some $4 with the new values and have the overall percent for each student.

The weight per category is delivered in file1, in the shape you showed post #1:

cat file1
Homework = 0.10
Lab = 0.30
Quiz = 0.40
Final = 0.15
Survey = 0.05

cotilloe · February 2, 2020, 3:46pm

Wow, thanks. I did not know how that worked. So basically, it works like CSS sort of, as long as I mention the file name at the end, it will look in those files for the input?

Just want to be sure I understand fully.

--- Post updated at 09:46 PM ---

rudic:

How far would this get you, printing the weighted total for each student. The weight per category is delivered in file1 in the form you posted :
awk '
FNR == NR       {WEIGHT[$1] = $3
   next
   }
FNR == 1        {next
   }
   {SUM[$1] += $4 * WEIGHT[$2]
   }
END             {for (s in SUM) print s, SUM
   }
' OFS="\t" file1 file2
Sam     349.45
Chelsey 536.95
Andrew  402.6
Shane   427.75
Ava     434.5
Be aware that nothing is known on the grade calculation algorithm nor the "possible total" that might be needed to calculate it.

I kind of figured it out from looking again but am still confused on a couple of things:

the first line is basically telling the script to get input from first file, creating and storing the array, weight which holds the values as such: weight[Homework:0.10, Lab :0.30, Quiz:0.40, Final :0.15, Survey:0.05]

How does it/why does it become weight[$2] though?

RudiC · February 2, 2020, 4:02pm

No, it stores it like

awk 'FNR==NR {WEIGHT[$1] = $3; next} END {for (w in WEIGHT) print "WEIGHT[\"" w "\"] =", WEIGHT[w]}' file1
WEIGHT["Lab"] = 0.30
WEIGHT["Homework"] = 0.10
WEIGHT["Survey"] = 0.05
WEIGHT["Quiz"] = 0.40
WEIGHT["Final"] = 0.15

You use each line's $2 immediately as index in the WEIGHT array.

cotilloe · February 2, 2020, 4:47pm

Ok, I think i have the basics of how to use the two files then. So, using the snippet of code you posted gives the total points for each student, but how do I divide those by 575.25 to get the final percentile score for each student?

I have tried a few things and none seem to work. have tried to do SUM[$1]/575.25... Illegal array reference error
have tried to say x=SUM[s]/575.25 at end, but just got 0 all the way down the list and saying x=SUM[s]/575.25 returned a single name and a 0...

RudiC · February 3, 2020, 3:03am

Show what you did. Include context. And how you came to 575.25.

cotilloe · February 3, 2020, 9:39am

I got 575.25 outside of the script. There are 7 Homework assignments, 7 Labs, 7 Quizzes and 1 Final with a total of 100 points possible on each. Then there is the survey worth a possible 5 points.

Knowing the weighted values:
Homework 10%
Labs 30%
Quizzes 40%
Final 15%
Survey 5%

I applied that to the possible total scores, as well
Homework --> 700*0.1 = 70
Labs --> 700 * 0.3 = 210
Quizzes --> 700 * 0.4 = 280
Final --> 100 * 0.15 = 15
Survey --> 5 * 0.05 = 0.25
all weighted percentage values equal 100, so it is good there. so I added the totals 70+210+280+15+.25 = 575.25

I was able to, using the code snippet you provided, divide the totals by the 575.25 and get the final weighted percentile score for each student. I then spent the next several hours trying to get a letter grade assigned to each one based on the percentile score, but had no luck. It currently gives everyone an 'A' no matter what their percent score was, which I am sure a student would like, but not going to work for me...lol.

Very frustrating, trying to learn awk on the fly like this. ... Here is the most current version of the code I have with various comments on why/what is going on:


awk -F, '
FNR == NR       {WEIGHT[$1] = $3
                 next
                }
FNR == 1        {next
                }
     {SUM[$1] += $4 * WEIGHT[$2]
     per[$1]=SUM[$1]/575.25*100       # gives me the correct weighted final scores for each student
     TOTAL=SUM[$1]/575.25*100       # Assign value to variable cuz I have tried and failed to access it directly from per
     grd[$1]                                            # initializing a new array.. i do not know why. I am just trying things at this point
for (g in grd)                                      # tried without a loop, so now trying in a loop
    if(TOTAL > 97)
    {
       gr="A+"        
    }
    else if (94 < TOTAL <= 97)
    {
       gr="A"                                          # Everyone gets an A which is strange since it is not the first possibility and the student that 
    }	                                                     # actually has an A is not the first record being processed
    else if (90 < TOTAL <= 94)
    {
       gr="A-"        
    }
    else if(87 < TOTAL <= 90)
    {
       gr="B+"        
    }
    else if (84 < TOTAL <= 87)
    {
       gr="B"        
    }
    else if (80 < TOTAL <= 84)
    {
       gr="B-"        
    }
    else if (76 < TOTAL <= 80)
    {
       gr="C+"        
    }
    else if (70 < TOTAL <= 76)
    {
       gr="C"        
    }
    else if (60 < TOTAL <= 70)
    {
       gr="D"        
    }
    else
    {
       gr="E"        
    }
grd[$1]=gr                                          # trying to assign the gr variable to the value of new array... did not work too well
}

END {print "Name\tPercent\tGrade\n" 
           for (p in per)
           printf "%s\t%.2f\t%s\n", p, per[p], grd[$1] }' OFS="\t" file1 data.csv     # have tried grd[p], just the variable gr and doing the calculations inline.

Adding output and noticed that the student I said deserves an A actually deserves an A-....
Output:
Name Percent Grade
Sam 60.75 A
Chelsey 93.34 A
Andrew 69.99 A
Shane 74.36 A
Ava 75.53 A

RudiC · February 3, 2020, 10:02am

Try instead

awk '
function GRD(AVG, R)    {if             (AVG >= 97)     return "A+"
                           else if      (AVG >= 94)     return "A"
                           else if      (AVG >= 90)     return "A-"
                           else if      (AVG >= 87)     return "B+"
                           else if      (AVG >= 84)     return "B"
                           else if      (AVG >= 80)     return "B-"
                           else if      (AVG >= 76)     return "C+"
                           else if      (AVG >= 70)     return "C"
                           else if      (AVG >= 60)     return "D"
                           else                         return "E"
                        }


FNR==NR         {WEIGHT[$1] = $3
                 next
                }
FNR == 1        {next
                }
                {SUM[$1] += $4 * WEIGHT[$2]
                }
END             {for (s in SUM) {AVG = SUM/5.7525
                                 print s, SUM, AVG, GRD(AVG)
                                }
                }
' OFS="\t" file1 file2
Sam     349.45  60.7475 D
Chelsey 536.95  93.342  A-
Andrew  402.6   69.987  D
Shane   427.75  74.359  C
Ava     434.5   75.5324 C

There are more elegant grade determination approaches, that is...

cotilloe · February 3, 2020, 5:58pm

Wow, very simple. But why wouldn't it work the way I was trying? (Using an if/else block)
What does the 'R' represent in the second arg for the function?

RudiC · February 3, 2020, 7:06pm

Difficult to say without in depth analysis. Guessing from a first peek: You got the blocks mixed up. The grade can only be determined when all results are summed up, i.e. in the END section.
And, you're testing too many conditions. Once the average IS NOT greater than or equal, it is automatically less, and the test can go.

Residual of former test - can go away.

How about

awk '
BEGIN           {split ("97 94 90 87 84 80 76 70 60 0", THRSH)
                 split ("A+ A  A- B+ B  B- C+ C  D  E", TMPGR)
                }
function GRD(AVG,   i)  {while ((AVG < THRSH[++i]) && (i < 11)) ;
                         return TMPGR
                        }


FNR==NR         {WEIGHT[$1] = $3
                 next
                }
FNR == 1        {next
                }
                {SUM[$1]    += $4 * WEIGHT[$2]
                 TOT[$2,$3] += $5 * WEIGHT[$2]
                 CNT[$2,$3]++
                }
END             {for (t in TOT) TOTAL += TOT[t] / CNT[t]
                 for (s in SUM) {AVG = SUM / TOTAL * 100 
                                 print s, SUM, AVG, GRD(AVG)
                                }
                }
' OFS="\t" OFMT="%.2f" file1 file2
Sam        349.45    60.75    D
Chelsey    536.95    93.34    A-
Andrew     402.60    69.99    D
Shane      427.75    74.36    C
Ava        434.50    75.53    C

Not sure if the algorithm to determine the TOTAL possible is reliable in other contexts - it takes the average per assignment of the participants thus ruling out students have missed an exam.

cotilloe · February 8, 2020, 12:18am

I did not see your reply with the alternate code. Thanks for that. As far as the "algorithm", lol, there really is nothing more to it. The actual file does not contain student grade data but it does have names, categories and title and then a numerical column which needs to be averaged for all titles and then a string value given based on avg per name, as you have helped me with here.

I actually have another issue. Posting it here for reference but will start a new thread and give a bit of back story on it, but I need to do the same EXACT thing, only now in Perl.... Like I said, I will explain more in a new thread, so this one can be closed properly.