CSV File parse help in Perl

lodey · March 29, 2008, 8:06am

Folks,

I have a bit of an issue trying to obtain some data from a csv file using PERL. I can sort the file and remove any duplicates leaving only 4 or 5 rows containing data. My problem is that the data contained in the original file contains a lot more columns and when I try ro run this script it finds that all the data is unique.

I have the following fields within the orignal file:
PROGRAM ID,OP,PROBE_CARD ,DEVREVSTEP,TEST_START,TESTER_ID

The data which I need to obtain and sort is within the op,probecard and tester_id fields.

How can I go about this?

The code that I use after manually deleting the fields that i do not require is as follows:

#!/usr/bin/perl -w
use strict;
my $csvfile = 'probecards.csv';
my $newfile = 'sorted.csv';
my $fieldnames = 1;
open (IN, "<$csvfile")  or die "Couldn't open input CSV file: $!";
open (OUT, ">$newfile") or die "Couldn't open output file: $!";
my $header;
$header = <IN> if $fieldnames;
my @data = sort <IN>;
print OUT $header;
my $n = 0;
my $lastline = '';
foreach my $currentline (@data) {
  next if $currentline eq $lastline;
  print OUT $currentline;
  $lastline = $currentline;
  $n++;
}
close IN; close OUT;
print "Processing complete. In = " . scalar @data . " records, Out = $n records\n";

Sample CSV File:


Original File:

PROGRAM	ID	OP	PROBE_CARD	DEVREVSTEP	TEST_START	TESTER_ID
12630M196	139	2660	S25E3N36	88BCRA	16/03/2008 12:05	IN01
12630M196	1	2660	S25E3N36	88BLBHD	16/03/2008 13:04	IN04
12630M196	508	2660	S25E3N36	88BCRA	16/03/2008 13:41	IN01
12630M196	437	2660	S25E3N36	88CLNHC	16/03/2008 14:18	IN01
12630M196	465	2660	S25E3N36	88BCRA	16/03/2008 15:34	IN02
12630M196	27	2660	S25E3N36	88BCRA	16/03/2008 18:00	IN01
12630M196	18	2660	S25E3N36	88BCRA	16/03/2008 19:03	IN03


Output:

OPERATION PROBE_CARD_ID	TESTER_ID
2660	S25E3N21	IN04
2660	S25E3N27	IN02
2660	S25E3N36	IN01
2660	S25E3N39	IN03
2660	S25E3N40	IN04

Any pointers on how I could go about this would be greatly appreciated..

Rgds

Colin

era · March 29, 2008, 8:23am

That's not a CSV file, it means literally comma-separated values. Traditionally CVS also uses optional double quotes around the values, and then, depending on which wicked version of Excel it's supposed to be incompatible with, more or less insane conventions for values with double quotes in them. Newlines inside a field are also sometimes permitted in CSV.

Is the data you want always in fields 3, 4, and 7?

cut -f3,4,7 infile

If you require this in Perl, how about

perl -naF"\t" -e 'print (join ("\t", map { $F[$_] } (2, 3, 6)), "\n")'

era · March 29, 2008, 8:27am

I guess the actual problem is that you are comparing the lines before throwing out the fields you don't want to include in the comparison. Before you print, copy the print value to a hash, and if the hash already contains the value, don't print. (Or take the Perl script I gave you and pipe to sort -u -- that's what I would do.)

lodey · March 29, 2008, 8:29am

Where would I put the line in the code?

era · March 29, 2008, 8:48am

cut -f3,4,7 infile | sort -u

That's the whole program.

lodey · March 29, 2008, 12:42pm

sorry to be annoying but how does the following perl code work?

perl -naF"\t" -e 'print (join ("\t", map { $F[$_] } (2, 3, 6)), "\n")'

If i inputed into my code would the { $F[$_] } part need to be made { $F[$csvfile] } ?

I am new to perl so its a learning experience for me :o

rgds

colin

era · March 29, 2008, 12:50pm

That's a somewhat idiosyncratic script, I'm not sure which parts are suitable for a person of your age.

(2, 3, 6) is a list. Map takes the function you give it and applies it to each item on the list in turn, and returns the result as a new list. Within map, $_ refers to the current item. So the map returns $F[2] followed by $F[3] followed by $F[6]. The rest is just formatting.

Oh, the final crucial point is the option -a (mnemonic: more like awk) which says to split the input into @F and, -F "\t" which says to do that splitting on the tab character.

lodey · March 29, 2008, 1:19pm

thanks for ur help....sorry for annoying u with my questions

era · March 29, 2008, 2:19pm

Hey, no problem.

If you want development ideas for your own script, think about the following.

The Unix toolkit design dictates that tools should read standard input and print to standard output. I'd suggest you try to think about how to change your script to a general-purpose tool. Start by throwing out the file handling -- the invoker can specify which file to read, and where to redirect the output.

How about an "uncut" tool? ("un" is supposed to stand for "unique", here.) Like cut, but only print the first occurrence of a given combination of field values. (Then expand it to accept other field separators than tab.)

This is just one out of many possibilities, but you should always be thinking about how to generalize your work. Over the years, you build up a collection of tools which suit your way of thinking, your way of working.

Your script doesn't do any actual parsing of the input yet -- the effect of the -a option can be duplicated with @F = split /\t/. I imagine you can take it from there.

lodey · March 30, 2008, 1:56pm

I have put the split in as a while loop...but it still doesnt work...I have done this correctly?

#!/usr/bin/perl -w

use strict;
my $csvfile = 'probecards.csv';
my $newfile = 'new.csv';
my $fieldnames = 1;
open (IN, "<$csvfile")  or die "Couldn't open input CSV file: $!";
open (OUT, ">$newfile") or die "Couldn't open output file: $!";
my $header;
$header = <IN> if $fieldnames;
my @data = sort <IN>;
while( <IN> ) {
    push @data, join "\t", (split /\t/)[4,5,8];
}
print OUT $header;
my $n = 0;
my $lastline = '';
foreach my $currentline (@data) {

  next if $currentline eq $lastline;
  print OUT $currentline;
  $lastline = $currentline;
  $n++;
}
close IN; close OUT;
print "Processing complete. In = " . scalar @data . " records, Out = $n records\n";

era · March 30, 2008, 2:02pm

Looks correct, superficially, but the sort needs to happen after the split; sort @data when you've finished reading the input. And you attempt to read <IN> twice, but taking out the line where you assign @data = sort <IN> will fix that, too. (You need to leave the my @data declaration, of course.)

The field numbers don't match what was discussed earlier but I guess you know what you are doing there. Or maybe not? They are off by two so maybe you adjusted them the wrong way. Array indices are zero-based in Perl so the first field is [0], the second is [1], the third is [2], and so forth.

lodey · March 30, 2008, 4:14pm

Hi era,

I removed the sort from the code and ran the script. Now what happens is that when it has been ran the number of duplicates are now zero...the complete opposite to what was happening before. I looked at the split and yes it was a mistype...it looks like I am doing something wrong as I know that the field for the operation and the tester do have duplicates

era · March 30, 2008, 11:43pm

You are removing duplicates with the next if ...

lodey · March 31, 2008, 9:06am

I think that I have it sorted now.

code is as follows:

#!/usr/bin/perl -w

use strict;
my $csvfile = 'probecards.csv';
my $newfile = 'new.csv';
my $fieldnames = 1;
open (IN, "<$csvfile")  or die "Couldn't open input CSV file: $!";
open (OUT, ">$newfile") or die "Couldn't open output file: $!";

my $header;
$header = <IN> if $fieldnames;
my @data ;
while( <IN> ) {
my (${LAO_START_WW},${PROGRAM},${ID},${OP},${PROBE_CARD},${DEVREVSTEP},${TEST_START},${TESTER_ID} ) = split(',',$_);
my $tempString = "${OP},${PROBE_CARD},${TESTER_ID}";
push @data,$tempString;
}
print OUT $header;
my $n = 0;
my $lastline = '';

foreach my $currentline (@data) {

  next if $currentline eq $lastline;
  print OUT $currentline;
  $lastline = $currentline;
  $n++;
}
close IN; close OUT;
print "Processing complete. In = " . scalar @data . " records, Out = $n records\n";

The only issue that I have now is that the header fields stay the same in the output fiile while the data is split...but I will try and work this out

Thanks for all your help era and pointing me in the right direction