Grouping and counting

Hello,

I would like to group/sort a file of records by a particular field and then count how many records belong in that grouping.

For example say I have the following data:

1234|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|AA
ABCD|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|AA
EFGH|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|BB
IJKLM|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|BB
NOPQ|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|BB
 

i.e. I'd like to group by the last field and count so effectively the output would look as so:

AA 2
BB 3

etc...

Try awk:

awk -F '|' {arr[$(NF)]++} END {for(i in arr) {i, arr}} ' inputfile

Data will not come out sorted, you can pipe the result into sort if you need it that way.

1 Like

Hello Nik44,

Welcome to forums, hope you will enjoy learning knowledge here. Could you please try following and let me know if this helps(if you are not bothered about sequence of output).

awk -F"|" '{A[$NF]++} END{for(i in A){print i, A}}'  Input_file

Output will be as follows.

BB 3
AA 2

If you need output in same sequence as Input_file then try as follows(By reading Input_file 2 times in following solution).

awk -F"|" 'FNR==NR{A[$NF]++;next} ($NF in A){print $NF, A[$NF];delete A[$NF]}'   Input_file  Input_file

Output will be as follows.

AA 2
BB 3

Hope this helps you.

Thanks,
R. Singh

Hi.

Also with standard utilities:

cut -f8 -d'|' <your-input-file> | sort | uniq -c 

producing

      2 AA
      3 BB

Best wishes ... cheers, drl

1 Like

Thanks for the reponses! I've gone for the more straight forward cut option as it provides the desired output.

However I forgot to add to this question... Is there a way of refining this output further so that it will only count a record of that group where field four is "1" and where field five is "Y"? I realise in the example supplied it would still show the exact same count.

Hello Nik44,

Could you please try following and let me know if this helps you.

awk -F"|" 'FNR==NR{gsub(/\"/,X,$0);if($4==1 && $5=="Y"){A[$NF]++};next} ($NF in A){print $NF, A[$NF];delete A[$NF]}'  Input_file  Input_file

On a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

Thanks,
R. Singh

Thanks Ravinder.

The command executes without any issues, however doesn't seem to have altered the file (is identical as to pre the command). Is this perhaps because my data includes literal double quotes?

I tried escaping them but doesn't seem to make any difference :confused:

Hello Nik44,

You could say Thanks to a person by hitting THANKS button at left of every post.
Above awk command will not put output into same Input_file, you should try following to do the same.

awk -F"|" 'FNR==NR{gsub(/\"/,X,$0);if($4==1 && $5=="Y"){A[$NF]++};next} ($NF in A){print $NF, A[$NF];delete A[$NF]}'  Input_file  Input_file > tmp_file
mv tmp_file  Input_file

Thanks,
R. Singh

1 Like

Since you like the solution using 'cut': You could use grep before doing the cut, to return only the lines with the desired properties.

Hi.

You're welcome.

My rule of thumb is that when a specific content of a specific field needs to be examined, manipulated, etc., then I reach for awk first ( perl second) because the field-separating facilities are very good.

If you can certify that the contents of fields 4 and 5 are unique to the content of all fields in a line, then you may be able to use the suggestion from rovf to use grep , because grep will consider the content of the entire line without regard to fields. Otherwise, an awk solution seems like the best approach.

Best wishes ... cheers, drl

To retain order you may try this

akshay@localhost tmp]$ cat file
1234|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|AA
ABCD|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|AA
EFGH|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|CC
IJKLM|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|CC
EFGH|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|BB
IJKLM|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|BB
NOPQ|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|BB
[akshay@localhost tmp]$ awk -F\| '!($NF in c){d[++q]=$NF}{c[$NF]++}END{for(i=1; i in d; i++)print d,c[d]}' file
AA 2
CC 2
BB 3

Given that the field *separators* are unique, we can use grep even if the same field contents already occurs earlier in the line. It's just that we have to use extended regular expressions, and anchor our search at the start of the line.

Something like

grep -E '^([^|]+[|]){3}.1...Y'  .....

You should do a forum search before opening a thread. Exactly your problem (and some of the suggested solutions) was discussed in length in this thread. It is called a "control break".

I hope this helps.

bakunin

1 Like

Hi.

I have posted a comparison of the grep and awk approach that rovf and I briefly discussed here. Because it is not strictly on-point here, I have created a new thread at http://www.unix.com/shell-programming-and-scripting/267875-one-instance-comparing-grep-awk.html\#post302979659

cheers, drl

One awk command might be faster than a grep | cut | sort | uniq command chain.
The FS or -F is an ERE, so one can put two optional " in it.

file="inputfile"
awk -F '"?[|]"?' '($4==1 && $5=="Y") { A[$NF]++ } END { for (i in A) print i, A }' $file > $file.new &&
mv $file.new $file

awk produces a new file. If sucessful the mv command replaces the input file with it.