Grouping and counting

Nik44 · August 15, 2016, 8:08am

Hello,

I would like to group/sort a file of records by a particular field and then count how many records belong in that grouping.

For example say I have the following data:

1234|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|AA
ABCD|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|AA
EFGH|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|BB
IJKLM|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|BB
NOPQ|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|BB

i.e. I'd like to group by the last field and count so effectively the output would look as so:

AA 2
BB 3

etc...

jim_mcnamara · August 15, 2016, 8:24am

Try awk:

awk -F '|' {arr[$(NF)]++} END {for(i in arr) {i, arr}} ' inputfile

Data will not come out sorted, you can pipe the result into sort if you need it that way.

RavinderSingh13 · August 15, 2016, 8:32am

Hello Nik44,

Welcome to forums, hope you will enjoy learning knowledge here. Could you please try following and let me know if this helps(if you are not bothered about sequence of output).

awk -F"|" '{A[$NF]++} END{for(i in A){print i, A}}'  Input_file

Output will be as follows.

BB 3
AA 2

If you need output in same sequence as Input_file then try as follows(By reading Input_file 2 times in following solution).

awk -F"|" 'FNR==NR{A[$NF]++;next} ($NF in A){print $NF, A[$NF];delete A[$NF]}'   Input_file  Input_file

Output will be as follows.

AA 2
BB 3

Hope this helps you.

Thanks,
R. Singh

drl · August 15, 2016, 8:41am

Hi.

Also with standard utilities:

cut -f8 -d'|' <your-input-file> | sort | uniq -c

producing

      2 AA
      3 BB

Best wishes ... cheers, drl

Nik44 · August 15, 2016, 9:52am

Thanks for the reponses! I've gone for the more straight forward cut option as it provides the desired output.

However I forgot to add to this question... Is there a way of refining this output further so that it will only count a record of that group where field four is "1" and where field five is "Y"? I realise in the example supplied it would still show the exact same count.

RavinderSingh13 · August 15, 2016, 10:35am

Hello Nik44,

Could you please try following and let me know if this helps you.

awk -F"|" 'FNR==NR{gsub(/\"/,X,$0);if($4==1 && $5=="Y"){A[$NF]++};next} ($NF in A){print $NF, A[$NF];delete A[$NF]}'  Input_file  Input_file

On a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

Thanks,
R. Singh

Nik44 · August 15, 2016, 10:52am

Thanks Ravinder.

The command executes without any issues, however doesn't seem to have altered the file (is identical as to pre the command). Is this perhaps because my data includes literal double quotes?

I tried escaping them but doesn't seem to make any difference

RavinderSingh13 · August 15, 2016, 11:01am

Hello Nik44,

You could say Thanks to a person by hitting THANKS button at left of every post.
Above awk command will not put output into same Input_file, you should try following to do the same.

awk -F"|" 'FNR==NR{gsub(/\"/,X,$0);if($4==1 && $5=="Y"){A[$NF]++};next} ($NF in A){print $NF, A[$NF];delete A[$NF]}'  Input_file  Input_file > tmp_file
mv tmp_file  Input_file

Thanks,
R. Singh

rovf · August 16, 2016, 2:05am

Since you like the solution using 'cut': You could use grep before doing the cut, to return only the lines with the desired properties.

drl · August 16, 2016, 8:32am

Hi.

You're welcome.

My rule of thumb is that when a specific content of a specific field needs to be examined, manipulated, etc., then I reach for awk first ( perl second) because the field-separating facilities are very good.

If you can certify that the contents of fields 4 and 5 are unique to the content of all fields in a line, then you may be able to use the suggestion from rovf to use grep , because grep will consider the content of the entire line without regard to fields. Otherwise, an awk solution seems like the best approach.

Best wishes ... cheers, drl

Akshay_Hegde · August 16, 2016, 8:44am

To retain order you may try this

akshay@localhost tmp]$ cat file
1234|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|AA
ABCD|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|AA
EFGH|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|CC
IJKLM|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|CC
EFGH|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|BB
IJKLM|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|BB
NOPQ|"ZZZ"|"Date"|"1"|"Y"|"ABC"|""|BB

[akshay@localhost tmp]$ awk -F\| '!($NF in c){d[++q]=$NF}{c[$NF]++}END{for(i=1; i in d; i++)print d,c[d]}' file
AA 2
CC 2
BB 3

rovf · August 16, 2016, 10:45am

Given that the field *separators* are unique, we can use grep even if the same field contents already occurs earlier in the line. It's just that we have to use extended regular expressions, and anchor our search at the start of the line.

Something like

grep -E '^([^|]+[|]){3}.1...Y'  .....

bakunin · August 16, 2016, 11:50am

You should do a forum search before opening a thread. Exactly your problem (and some of the suggested solutions) was discussed in length in this thread. It is called a "control break".

I hope this helps.

bakunin

drl · August 17, 2016, 12:43pm

Hi.

I have posted a comparison of the grep and awk approach that rovf and I briefly discussed here. Because it is not strictly on-point here, I have created a new thread at http://www.unix.com/shell-programming-and-scripting/267875-one-instance-comparing-grep-awk.html\#post302979659

cheers, drl

MadeInGermany · August 17, 2016, 2:11pm

One awk command might be faster than a grep | cut | sort | uniq command chain.
The FS or -F is an ERE, so one can put two optional " in it.

file="inputfile"
awk -F '"?[|]"?' '($4==1 && $5=="Y") { A[$NF]++ } END { for (i in A) print i, A }' $file > $file.new &&
mv $file.new $file

awk produces a new file. If sucessful the mv command replaces the input file with it.