Sort and Unique in Perl

Hi,

May I know, if a pipe separated File is large, what is the best method to calculate the unique row count of 3rd column and get a list of unique value of the 3rdcolum?

Thanks in advance!

read the file line by line and use a hash to get the unique values of the 3rd column.

Could you kindly explain with one simple example.

Maybe realy simple:

cat $file|awk -F\| '{print $3}'|sort -u

(I think some awk guru can do it with less commands)...

I saw to late you meant perl.. sorry

I can handle with cut, sort -u, and wc commands.

But looking for perl methods!

read the file
split the record
use the third field
populate in a hash => this would maintain uniqueness
when displaying use sort keys %hash

awk -F"|" '{ print $3 }' file | sort -u

Thanks for your input. Iam new to perl is it possible to give one simple example pls.

sample code and file
try this

>cat b
1|2|3
4|9|4
3|1|2
>cat b.pl
#! /opt/third-party/bin/perl

open(FILE, "<", $ARGV[0]);

while(<FILE>) {
  chomp;
  my @arr = split(/\|/);
  $fileHash{$arr[1]}++;
}

close(FILE);

foreach my $k ( sort keys %fileHash ) {
  print "$k\n";
}

exit 0

cut -d'|' f3 file_name | sort -u

Almost same as MatrixMadhan's. Row count was not included, so the following code is just for completeness:

#!/usr/bin/perl -w
use strict;

# Program to get unique values for 3rd column and print them

open(FILE, "b.txt");

my %list = ();

while(<FILE>){
   chomp;
   my @array = split(/\|/);
   $list{$array[2]}++;
}

close(FILE);

# Print out the results

foreach my $value (sort keys %list) {
   print "The unique values are $value\n";
}

print "Number of rows are ".keys(%list);

Good examples already, this is just a more compact form of the same thing:

#!/usr/bin/perl
use warnings;
use strict;
unless ($ARGV[0]) {
    die "Usage: perl scriptname.pl filename";
}
my %list = ();
while(<>){
   $list{(split(/\|/))[2]}++;
}
print "$_ = $list{$_}\n" for (keys %list);
exit(0);

Uses perls optimized filehandling and no temp variables so should be fast and efficient.

Thanks every one.
Thanks MatrixMadhan and MobileUser!!
Iam able to make use of the same code.
Thanks KevinADC, Just that, the count is comming for each entry of hash. I would like to have one count at the end. So that I can write the unique keys to a file and Count to a seperate header file.

Thanks again.

Iam using lot of other packages in the script. So when I try to use it gives me an error.

Global symbol "%list" requires explicit package name at b.pl line 19

How to declare the hash variable in the script prior to declaration?

Sorry it was typo in my code. It worked by declaring it earliar.
Thanks!

One more silly doubt. How to get max value of the %list.
And if I include another field to the existing hash key, will it be composite or individual.
Say I want to find max of arr[2] and need to find the unique and count of list of arr[1], is there a way to use same hash?
Of its suggested to use different has variable?

Thanks.

To find the max value (the highest count I assume) the easiest thing would be to sort the hash by the values:

@sorted = sort { $list{$a} <=> $list{$b} } keys %list;
print $sorted[-1];

It depends on how you "include" another field into the hash, by field I assume you mean a hash key with a value. Keep in mind that perl hashes do not have indices like perl arrays, there is no such thing as hash[2]. Hashes are not ordered lists like arrays are.

Thanks for your input.

Here is my requirement.

I need to find max of a date field, which is in 2nd field. And I need to get Unique value of 5th field (Item no) and I need to get number of unique item in 5th field.

Pls help.

post some lines of the data you are parsing, and show what the results are you want from those lines.

Thanks for the reply.
Here is the example of the input file.

26|2007-04-16|76
26|2007-04-18|81
26|2007-04-19|70
26|2007-04-20|84
26|2007-04-21|75
26|2007-04-22|57
26|2007-04-16|109
26|2007-04-18|114
26|2007-04-19|129
26|2007-04-20|157

I want the result to be

Max Date	Item No	Total Count of Item
2007-04-22	109	10
2007-04-22	114	10
2007-04-22	129	10
2007-04-22	157	10
2007-04-22	57	10
2007-04-22	70	10
2007-04-22	75	10
2007-04-22	76	10
2007-04-22	81	10
2007-04-22	84	10

Right now, iam storing the item no in hash, and during that, manually checking the max date condition using If ($CurrentDate > $MaxDate).... logic.
And Iam getting the total count like

my $TotalRows = keys(%fileHash1)

Let me know if, other simple effecient method is avaialble.

Thanks in advance.