Hi all,
I have a quick question:
I have a 4 column tab-separated file.
I want to count the number of times each unique value in column 2 appears and add that number in a 5th column.
I have the following input file:
waterline-n below-sheath-v 14.8097 A
dock-n below-sheath-v 14.5095 B
waterline-n below-steel-n 11.0330 A
picnic-n below-steel-n 12.2277 C
wavefront-n at-part-of-variance-n 18.4888 L
wavefront-n between-part-of-variance-n 17.0656 A
audience-b between-part-of-variance-n 17.6346 B
game-n between-part-of-variance-n 14.9652 C
whereabouts-n become-rediscovery-n 11.3556 L
whereabouts-n get-tee-n 10.9091 L
and the following is the desired output
waterline-n below-sheath-v 14.8097 A 2
dock-n below-sheath-v 14.5095 B 2
waterline-n below-steel-n 11.0330 A 2
picnic-n below-steel-n 12.2277 C 2
wavefront-n at-part-of-variance-n 18.4888 L 1
wavefront-n between-part-of-variance-n 17.0656 A 3
audience-b between-part-of-variance-n 17.6346 B 3
game-n between-part-of-variance-n 14.9652 C 3
whereabouts-n become-rediscovery-n 11.3556 L 1
whereabouts-n get-tee-n 10.9091 L 1
How can I combine sort and grep for the desired output?
Thank you.
I just realized that your script counts all of the unique instances in Column 2.
However, I need to count the number of different items in Column 4 that a value in Column 2 occurs with in --- in this case, real frequency does not matter.
I will provide more sample files, in case my question was not clear:
For instance this input file:
waterline-n below-sheath-v 14.8097 A
dock-n below-sheath-v 14.5095 B
waterline-n below-steel-n 11.0330 A
picnic-n below-steel-n 12.2277 C
game-n below-steel-n 12.2277 D
dock-n below-steel-n 12.2277 D
wavefront-n at-part-of-variance-n 18.4888 L
wavefront-n between-part-of-variance-n 17.0656 A
audience-b between-part-of-variance-n 17.6346 B
game-n between-part-of-variance-n 14.9652 C
whereabouts-n become-rediscovery-n 11.3556 L
whereabouts-n get-tee-n 10.9091 L
should yield this output file:
waterline-n below-sheath-v 14.8097 A 2
dock-n below-sheath-v 14.5095 B 2
waterline-n below-steel-n 11.0330 A 3
picnic-n below-steel-n 12.2277 C 3
game-n below-steel-n 12.2277 D 3
dock-n below-steel-n 12.2277 D 3
wavefront-n at-part-of-variance-n 18.4888 L 1
wavefront-n between-part-of-variance-n 17.0656 A 3
audience-b between-part-of-variance-n 17.6346 B 3
game-n between-part-of-variance-n 14.9652 C 3
whereabouts-n become-rediscovery-n 11.3556 L 1
whereabouts-n get-tee-n 10.9091 L 1
Where "below-steal-n" acutally occurs 4 times --- yet with 3 different items in Column 4 --- thus, in Column 5 its result is 3.
Not sure if your sample output is correct based on your description: the 'between-part-of-variance-n' fields are not counted correctly.
Pls validate/clarify