Counting specific column and add result in output

owwow14 · October 31, 2013, 7:09am

Hi all,
I have a quick question:
I have a 4 column tab-separated file.
I want to count the number of times each unique value in column 2 appears and add that number in a 5th column.

I have the following input file:

waterline-n    below-sheath-v    14.8097    A
dock-n    below-sheath-v     14.5095    B
waterline-n    below-steel-n    11.0330    A
picnic-n    below-steel-n    12.2277    C
wavefront-n    at-part-of-variance-n    18.4888    L
wavefront-n    between-part-of-variance-n    17.0656    A
audience-b    between-part-of-variance-n    17.6346    B
game-n    between-part-of-variance-n    14.9652    C
whereabouts-n    become-rediscovery-n    11.3556    L
whereabouts-n    get-tee-n    10.9091    L

and the following is the desired output

waterline-n    below-sheath-v    14.8097    A   2
dock-n    below-sheath-v     14.5095    B   2
waterline-n    below-steel-n    11.0330    A   2
picnic-n    below-steel-n    12.2277    C   2
wavefront-n    at-part-of-variance-n    18.4888    L   1
wavefront-n    between-part-of-variance-n    17.0656    A   3
audience-b    between-part-of-variance-n    17.6346    B   3
game-n    between-part-of-variance-n    14.9652    C   3
whereabouts-n    become-rediscovery-n    11.3556    L   1
whereabouts-n    get-tee-n    10.9091    L  1

How can I combine sort and grep for the desired output?
Thank you.

Franklin52 · October 31, 2013, 7:23am

Try this:

awk 'NR==FNR{a[$2]++;next}{print $0 "\t" a[$2]}' file file

owwow14 · October 31, 2013, 12:13pm

Thanks franklin52:

I just realized that your script counts all of the unique instances in Column 2.
However, I need to count the number of different items in Column 4 that a value in Column 2 occurs with in --- in this case, real frequency does not matter.

I will provide more sample files, in case my question was not clear:

For instance this input file:

waterline-n    below-sheath-v    14.8097    A 
dock-n    below-sheath-v     14.5095    B 
waterline-n    below-steel-n    11.0330    A 
picnic-n    below-steel-n    12.2277    C 
game-n    below-steel-n    12.2277    D 
dock-n    below-steel-n    12.2277    D 
wavefront-n    at-part-of-variance-n    18.4888    L 
wavefront-n    between-part-of-variance-n    17.0656    A 
audience-b    between-part-of-variance-n    17.6346    B 
game-n    between-part-of-variance-n    14.9652    C 
whereabouts-n    become-rediscovery-n    11.3556    L 
whereabouts-n    get-tee-n    10.9091    L

should yield this output file:

waterline-n    below-sheath-v    14.8097    A   2
 dock-n    below-sheath-v     14.5095    B   2 
waterline-n    below-steel-n    11.0330    A    3
 picnic-n    below-steel-n    12.2277    C   3 
game-n    below-steel-n    12.2277    D 3 
dock-n    below-steel-n    12.2277    D 3 
wavefront-n    at-part-of-variance-n    18.4888    L    1 
wavefront-n    between-part-of-variance-n    17.0656    A   3
 audience-b    between-part-of-variance-n    17.6346    B    3 
game-n    between-part-of-variance-n    14.9652    C    3 
whereabouts-n    become-rediscovery-n    11.3556    L   1
 whereabouts-n    get-tee-n    10.9091    L  1

Where "below-steal-n" acutally occurs 4 times --- yet with 3 different items in Column 4 --- thus, in Column 5 its result is 3.

vgersh99 · October 31, 2013, 1:02pm

awk '
{ idx=($2 SUBSEP $4)}
FNR==NR {if (!(idx in c)) {c[idx]++;cf2[$2]++};next}
{ print $0, cf2[$2] }' inFile inFile

Not sure if your sample output is correct based on your description: the 'between-part-of-variance-n' fields are not counted correctly.
Pls validate/clarify

owwow14 · October 31, 2013, 1:09pm

Your are correct.
I revised the answer to reflect the change.
Human error!