How to calculate a sum of certain records?

sickboy · June 10, 2005, 9:23am

Hi,
i have a file where the records are like this.

vt100 2048 D402 MG0010 0
586 262144 D403 MG0011 1000
486 8192 D404 MG0012 270
386 8192 A423 CC0177 40
586 65536 A424 CC0182 670
486 16384 A423 CC0183 100
486 16384 A425 CC0184 80
65000 4096 B407 EE1027 80

I want firstly,
to count how many times each combination of the first 2 letters from the 4th field exist and print it like this:
3 MG or MG 3
5 CC

I tried this :

awk -F"\t" '{print $4}' pcs.txt | uniq -c

but I didn't know how to count only the first two letters of the fied.

Secondly,
I want to sum the second field for each differrent entry of the 3rd field and print them. Like this
A423 24576
A425 16384

I tried this :

awk -F"\t" '{a[$3]=$3; 
for (i = 1; i <= NR; i++){
for (i = 1; i <= NR; i++){
if(a[++i] = $3){
size+=$2}
}
print a" "size
}' pcs.txt

but somewhere I have a mistake and I can't find it and also i am not sure if this is going to give me what I want.

Please help me

vino · June 10, 2005, 9:50am

This should do it for your first counting part...

sed -n -e 's/.*\([A-Z][A-Z]\)[0-9]*.*/\1/p' pcs.txt | uniq -c

For your input, it gave,

  3 MG
  4 CC
  1 EE

Basically, the scripts strips out every character except for the two upper-case characters. And the script assumes you always have input which obey the format you provided.

Vino

vino · June 10, 2005, 9:56am

Oops...

Got the wrong message in the wrong thread.

Vino

sickboy · June 10, 2005, 10:07am

Maybe I didn't explained it right. I need to find the sum for every different record of the 3rd field. So I need to sum the two numbers of the A423 and print the total.

PS Sorry for my English

sickboy · June 10, 2005, 10:18am

It is working for the part of the file i gave in the beginning but it is not working in the whole file right. Probably uniq -c is counting the same lines that are one next to the other and not spread in the file. What I did to get the result I wanted is the use of sort.

sed 's/.*\([A-Z][A-Z]\)[0-9]*.*/\1/p' pcs.txt |sort| uniq -c

Do you know if instead of sed I can use awk to get the same result?

vino · June 10, 2005, 10:32am

I think for uniq to work properly, it needs a sort'ed list.

Not sure, tho'.

Vino

r2007 · June 10, 2005, 10:39am

for 2nd question:

awk -F"\t" '{a[$3]+=$2}END{for (i in a) print i,a}'

[not test]

vino · June 10, 2005, 10:50am

You could use awk, but then you have a requirement to find the frequency based on the first 2 characters of the 4th field.

I am not sure, how you can extract characters using awk.

Probably someone could post a solution for that.

Vino

r2007 · June 10, 2005, 11:01am

awk -F"\t" '{a[substr($4,1,2)]++}END{for (i in a) print i,a}'

sickboy · June 10, 2005, 11:04am

Thanks a lot. It is working properly but I cannot understand how it is working.

When you this
a[$3]+=$2
this means that you put the 3rd field in a table and you assign his value to be equal with the sum of its second field?????
And why we don't get many times the same 3rd field when printing (I don't want to be printed many times, I just want to understand how it is working).

Cheers

r2007 · June 10, 2005, 11:43am

a[$3]+=$2 <===> a[$3]=a[$3]+$2
with the following data
386 8192 A423 CC0177 40
586 65536 A424 CC0182 670
486 16384 A423 CC0183 100
486 16384 A425 CC0184 80
65000 4096 B407 EE1027 80

AWK processes data line by line
line #1: a["A423"]=a["A423"]+8192=8129
line #2: a["A424"]=a["A424"]+65536=65536
line #3: a["A423"]=a["A423"]+16384=8129+16384
...
...
...

Sorry for my pool English. That's all what I can explain to you.