Extracting records with unique fields from a fixed width txt file

sitney · February 9, 2008, 1:32am

Greetings,

I would like to extract records from a fixed width text file that have unique field elements.

Data is structured like this:

John A Smith NY
Mary C Jones WA
Adam J Clark PA
Mary Jones WA

Fieldname / start-end position
Firstname 1-10
MI 11-12
Lastname 13-23
State 24-25

I want to compare firstname and lastname fields exclusively and output the unique records to a new file:
John A Smith NY
Adam J Clark PA

Any assistance would be greatly appreciated.

KevinADC · February 9, 2008, 3:56am

Your requirements are a bit vague, but here is a possible perl solution:

#!/usr/bin/perl
use warnings;
use strict;
#use Data::Dumper; #uncomment for debugging
unless (scalar @ARGV == 2){
   die "Usage: perl scriptname.pl inputfile outputfile\n";
}

my $outfile = pop @ARGV;
my %names = ();
my %count = ();

while (<>){
   chomp;
   my ($first,$mi,$last,$state) = unpack("a10a2a11a2",$_);
   (s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);
   $names{"$first,$last"}={count => ++$count{"$first,$last"},
                           name => "$first $mi $last $state",
                          };
}

#print Dumper \%names; #uncomment for debugging  

open my $out , '>' , $outfile or die "$!"; 

foreach my $person (keys %names) {
    next if $names{$person}{count}>1;
    print $out $names{$person}{name},"\n";
}

close $out;

print STDOUT "finished";
exit(0);

Usage:

perl scriptname.pl path/to/inputfile path/to/outputfile

sitney · February 9, 2008, 10:44am

KevinADC - I really appreciate your response here.

It works! When I run your perl script, I get these results:
$ cat newnames.txt
John A Smith NY
Adam J Clark PA

Despite my vague requirements, you understood them perfectly.

I am trying to decipher the workhorse part of the script you wrote:

while (<>){
chomp;
#Assign variables to fixed width sections using unpack.
my ($first,$mi,$last,$state) = unpack("a10a2a11a2",$_);

#Remove whitespace from variables.
(s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);

#Please describe what is going on here.
$names{"$first,$last"}={count => ++$count{"$first,$last"},
name => "$first $mi $last $state",
};
}

Thanks again.

KevinADC · February 9, 2008, 3:43pm

I'll try....

$names{"$first,$last"} creates a hash key from the first and last name.

its' value is in turn a hash:

$names{"$first,$last"} = {count=>'' , name => '' };

the "count" keys value is the value of another hash: %count, which is keeping count of how many times the first,last names are found:

++$count{"$first,$last"}

so we can determine later if it is a unique combination or not. If it has a count of 1 (one) then it is unique.

the "name" keys is just the original line from the file which we use to print to the output file if the value of the "count" key is 1 (one).

You can uncomment the lines that say to "uncomment for debugging" and you will see the data structure of %names printed when the script finishes running.

KevinADC · February 9, 2008, 3:51pm

You have here:

That part actually removes leading and trailing spaces from the list of variables. If there are internal spaces they are kept because names can have spaces in them, and if you removed the internal spaces you could potentially create false matches, example:

John W "Van Johnson" (last name in quotes to show it is one field)

John W VanJohnson

This is probaly a rare circumstance (and not a very good example) but it is possible, especially if the names are not in English.

sitney · February 10, 2008, 12:45am

You said,

I am crystal clear with this clarification. Thanks.

However, the hash structure you used

$names{"$first,$last"}={count => ++$count{"$first,$last"},
name => "$first $mi $last $state",
};

is so compact and does so much, that even with your description, it remains beyond my full grasp at this stage of my perl newbishness.

Even though I don't fully grasp this data structure, I can use it, modify it, and apply it. So thanks again KevinADC!

KevinADC · February 10, 2008, 1:05am

You're welcome. Actually that data structure could have been a bit simpler:

while (<>){
   chomp;
   my ($first,$mi,$last,$state) = unpack("a10a2a11a2",$_);
   (s/^\s*//, s/\s*$//) for ($first,$mi,$last,$state);
   $names{"$first,$last"}{count}++;
   $names{"$first,$last"}{name} = "$first $mi $last $state",
}

This eliminates the need for the seperate hash to keep track of the counts. I like to use the seperate hash for counts because in general data is much more complex than this and incrementing a count can be much easier done if it is kept seperate.

KevinADC · February 10, 2008, 1:08am

A tutorial on complex data structures:

perldsc - perldoc.perl.org

sitney · February 10, 2008, 2:18am

   $names{"$first,$last"}{count}++;

This is indeed much simpler looking. I will take a stab at describing this line: This hash idiom associates each $first,$last element with it's count.

 $names{"$first,$last"}{name} = "$first $mi $last $state",

This hash idiom associates the full record ($first $mi $last $state) with each $first,$last instance in hash %names.

If my language is not correct in this description, then my thinking is also incorrect. I am going to reread this article on hashes by Simon Cozens:
perl.com: Hash Crash Course and try to get a fuller understanding.

Thanks.