Count and keep duplicates in Column

pshields1984 · March 23, 2016, 11:41am

Hi folks,

I've got a csv file called test.csv

Column A Column B
Apples      1900
Apples      1901
Pears        1902
Pears        1903

I want to count and keep duplicates in the first column. Desired output

Column A Column B Column C
Apples          2              1900
Apples          2              1901
Pears            2              1902
Pears            2              1903

I have tried sort and uniq but to no avail, the uniq -c removes the duplicates. I need to keep them.

Any help would be great.

Thanks.

RudiC · March 23, 2016, 12:05pm

Please use code tags as required by forum rules!

I guess the second column header should go with the column, no? Having fields with the field separator inside doesn't really help processing. Try

awk 'NR == FNR {T[$1]++; next} FNR == 1 {print $1, $2, "CNT", $3, $4; next} {print $1, T[$1], $2}' file file
Column A CNT Column B
Apples 2 1900
Apples 2 1901
Pears 2 1902
Pears 2 1903

pshields1984 · March 23, 2016, 1:58pm

Thank you so much, I am almost there. Long time lurker first time poster, apologies about quoting code correctly. Can you explain the command? I don't really need column headers to make things more straight forward.

RudiC · March 23, 2016, 2:24pm

It's two passes across the same file - first pass to count the occurrences, the second to print the fields plus the count.

Aia · March 23, 2016, 2:39pm

If you do not need the header:

awk 'NR == FNR {T[$1]++; next} FNR > 1{print $1, T[$1], $2}' pshields1984.input  pshields1984.input

NR == FNR {T[$1]++; next} # execute only in the first pass reading input
FNR > 1{print $1, T[$1], $2}  # skip first line and insert the tally in from previous read after the first column

Some Perl code that could be more flexible.

#!/usr/bin/perl

use strict;
use warnings;

my $filename = shift or die "Usage: $0 FILENAME\n";
my %tally;

open my $fh, '<', $filename or die "Could not open $filename: $!\n";

<$fh>;
my $data_position = tell $fh;

while (my $entry = <$fh>) {
    my ($id) = split '\s+', $entry;
    $tally{$id}++;
}
seek $fh, $data_position, 0;
while (my $entry = <$fh>) {
    my @fields = split '\s+', $entry;
    splice @fields, 1,0, $tally{$fields[0]};
    print "@fields\n";
}
close $fh;

Save as tally.pl
Run as perl tally.pl pshields1984.input

pshields1984 · March 23, 2016, 7:34pm

Thank you RudiC. It worked a charm.

---------- Post updated at 06:34 PM ---------- Previous update was at 02:18 PM ----------

Thanks Aia I'll give the perl suggestion a go.