I have a bit of an issue trying to obtain some data from a csv file using PERL. I can sort the file and remove any duplicates leaving only 4 or 5 rows containing data. My problem is that the data contained in the original file contains a lot more columns and when I try ro run this script it finds that all the data is unique.
I have the following fields within the orignal file:
PROGRAM ID,OP,PROBE_CARD ,DEVREVSTEP,TEST_START,TESTER_ID
The data which I need to obtain and sort is within the op,probecard and tester_id fields.
How can I go about this?
The code that I use after manually deleting the fields that i do not require is as follows:
#!/usr/bin/perl -w
use strict;
my $csvfile = 'probecards.csv';
my $newfile = 'sorted.csv';
my $fieldnames = 1;
open (IN, "<$csvfile") or die "Couldn't open input CSV file: $!";
open (OUT, ">$newfile") or die "Couldn't open output file: $!";
my $header;
$header = <IN> if $fieldnames;
my @data = sort <IN>;
print OUT $header;
my $n = 0;
my $lastline = '';
foreach my $currentline (@data) {
next if $currentline eq $lastline;
print OUT $currentline;
$lastline = $currentline;
$n++;
}
close IN; close OUT;
print "Processing complete. In = " . scalar @data . " records, Out = $n records\n";
That's not a CSV file, it means literally comma-separated values. Traditionally CVS also uses optional double quotes around the values, and then, depending on which wicked version of Excel it's supposed to be incompatible with, more or less insane conventions for values with double quotes in them. Newlines inside a field are also sometimes permitted in CSV.
Is the data you want always in fields 3, 4, and 7?
I guess the actual problem is that you are comparing the lines before throwing out the fields you don't want to include in the comparison. Before you print, copy the print value to a hash, and if the hash already contains the value, don't print. (Or take the Perl script I gave you and pipe to sort -u -- that's what I would do.)
That's a somewhat idiosyncratic script, I'm not sure which parts are suitable for a person of your age.
(2, 3, 6) is a list. Map takes the function you give it and applies it to each item on the list in turn, and returns the result as a new list. Within map, $_ refers to the current item. So the map returns $F[2] followed by $F[3] followed by $F[6]. The rest is just formatting.
Oh, the final crucial point is the option -a (mnemonic: more like awk) which says to split the input into @F and, -F "\t" which says to do that splitting on the tab character.
If you want development ideas for your own script, think about the following.
The Unix toolkit design dictates that tools should read standard input and print to standard output. I'd suggest you try to think about how to change your script to a general-purpose tool. Start by throwing out the file handling -- the invoker can specify which file to read, and where to redirect the output.
How about an "uncut" tool? ("un" is supposed to stand for "unique", here.) Like cut, but only print the first occurrence of a given combination of field values. (Then expand it to accept other field separators than tab.)
This is just one out of many possibilities, but you should always be thinking about how to generalize your work. Over the years, you build up a collection of tools which suit your way of thinking, your way of working.
Your script doesn't do any actual parsing of the input yet -- the effect of the -a option can be duplicated with @F = split /\t/. I imagine you can take it from there.
I have put the split in as a while loop...but it still doesnt work...I have done this correctly?
#!/usr/bin/perl -w
use strict;
my $csvfile = 'probecards.csv';
my $newfile = 'new.csv';
my $fieldnames = 1;
open (IN, "<$csvfile") or die "Couldn't open input CSV file: $!";
open (OUT, ">$newfile") or die "Couldn't open output file: $!";
my $header;
$header = <IN> if $fieldnames;
my @data = sort <IN>;
while( <IN> ) {
push @data, join "\t", (split /\t/)[4,5,8];
}
print OUT $header;
my $n = 0;
my $lastline = '';
foreach my $currentline (@data) {
next if $currentline eq $lastline;
print OUT $currentline;
$lastline = $currentline;
$n++;
}
close IN; close OUT;
print "Processing complete. In = " . scalar @data . " records, Out = $n records\n";
Looks correct, superficially, but the sort needs to happen after the split; sort @data when you've finished reading the input. And you attempt to read <IN> twice, but taking out the line where you assign @data = sort <IN> will fix that, too. (You need to leave the my @data declaration, of course.)
The field numbers don't match what was discussed earlier but I guess you know what you are doing there. Or maybe not? They are off by two so maybe you adjusted them the wrong way. Array indices are zero-based in Perl so the first field is [0], the second is [1], the third is [2], and so forth.
I removed the sort from the code and ran the script. Now what happens is that when it has been ran the number of duplicates are now zero...the complete opposite to what was happening before. I looked at the split and yes it was a mistype...it looks like I am doing something wrong as I know that the field for the operation and the tester do have duplicates
#!/usr/bin/perl -w
use strict;
my $csvfile = 'probecards.csv';
my $newfile = 'new.csv';
my $fieldnames = 1;
open (IN, "<$csvfile") or die "Couldn't open input CSV file: $!";
open (OUT, ">$newfile") or die "Couldn't open output file: $!";
my $header;
$header = <IN> if $fieldnames;
my @data ;
while( <IN> ) {
my (${LAO_START_WW},${PROGRAM},${ID},${OP},${PROBE_CARD},${DEVREVSTEP},${TEST_START},${TESTER_ID} ) = split(',',$_);
my $tempString = "${OP},${PROBE_CARD},${TESTER_ID}";
push @data,$tempString;
}
print OUT $header;
my $n = 0;
my $lastline = '';
foreach my $currentline (@data) {
next if $currentline eq $lastline;
print OUT $currentline;
$lastline = $currentline;
$n++;
}
close IN; close OUT;
print "Processing complete. In = " . scalar @data . " records, Out = $n records\n";
The only issue that I have now is that the header fields stay the same in the output fiile while the data is split...but I will try and work this out
Thanks for all your help era and pointing me in the right direction