field separator in Perl

ahsog · April 9, 2009, 6:20am

is there a similar parameter you can set in perl like FS in awk?
I think I've read all the tutorials on the subject, but cannot get this map split and so on thing to work.
I need to sort a file by columns, eg. first, third, fifth...
The script I need to add this column sorting is this:

use strict;
use warnings;
open (_file_, "< path-to-file")  or  die "Failed to read file : $! ";
my @not_sorted = <_file_>; 
sub normalize {
   my $in = $_[0];
   $in = lc($in);
   $in =~ tr<aeiou>
   <aeiouu>;
   $in =~ tr<abcdefghijklmnopqrs�tuvwxyz>
   <\x01-\x1B>;
   return $in;
}
my @sorted  = sort{ normalize($a) cmp normalize($b) or $a cmp $b}  @not_sorted;
print @sorted;
close (_file_);

ghostdog74 · April 9, 2009, 11:50am

maybe you can find the answer when you type : perldoc perlvar

ahsog · April 9, 2009, 12:11pm

thanks, I see i have some reading to do...

quirkasaurus · April 9, 2009, 12:25pm

i couldn't find it....

i recommend:

( @a_junk ) = split( /\|/, $line );

or whatever else other than pipe you need.

ghostdog74 · April 9, 2009, 12:56pm

for Perl 5.8, might be able to do so

# perl --help

Usage: perl [switches] [--] [programfile] [arguments]
......
  -a              autosplit mode with -n or -p (splits $_ into @F)
.....................................
  -F/pattern/     split() pattern for -a switch (//'s are optional)
.................
  -n              assume "while (<>) { ... }" loop around program
  -p              assume loop like -n but print line also, like sed

KevinADC · April 9, 2009, 1:02pm

Only on the command line is there a field seperator option, -F, you can use to tell split() where to split something. You use it in conjunction with the -a option.

In a script you have to define that yourself by providing the split function with an argument or allowing split to use its default argument, which is whitespace. See the split() function for more details.

Edit: looks like ghostdog already posted while I was finding the command line switches.

ahsog · April 9, 2009, 7:38pm

thank you all, I'll give it a try tonight.
But how do I tell Perl which columns to work on? It is a latex file with tabular, thus having "&" as field separator. Is it something like "@1", "@2" etc.?

KevinADC · April 9, 2009, 8:39pm

perl arrays start at index 0 (zero). A very brief example:

@array = (1 , 2 , 3 , 4);
print $array[0];#prints 1
$array[2] = 'foo'; #changes 3 to foo

quirkasaurus · April 10, 2009, 11:25am

say the input line looks like this:

A|aardvark|ant

( $letter, $long_nm, $short_nm ) = split( /\|/, $line );

works, as does:

( @a_junk ) = split( /\|/, $line );

print "1st element: $a_junk[0] \n";
print "2nd element: $a_junk[1] \n";
print "3rd element: $a_junk[2] \n";

ahsog · April 10, 2009, 7:38pm

Thank you all. To tell the truth, I'm getting quite confused. I think I have to do some serious reading of the man pages, because I still cannot figure out how to make this split work with my posted script.
Say I have this file:

bbc&aaa&aaa
mmn&aaa&ccc
lmn&bbb&aaa
aaa&ccc&ddd
���&&
sss&&aaa
zzz&&
aaa&bbb&ccc
aaa&aaa&bbb
uuu&&
�as&&
sa�&&
cab&&
uu&&
uu&&
&&

Actually, I need the rows to be unchanged, since they are rows of a database query. so I just need to sort by first column.
If I change my script like this:

use strict;
use warnings;
open (_file_, "< path-to-file")  or  die "Failed to read file : $! ";
my @not_sorted = <_file_>; 
sub normalize {
   my $in = $_[0];
   $in = lc($in);
   $in =~ tr<aeiou>
   <aeiouu>;
   $in =~ tr<abcdefghijklmnopqrs�tuvwxyz>
   <\x01-\x1B>;
   return $in;
}
my @splitted = split(/&/, @not_sorted);
my @sorted =
sort {normalize($a) cmp normalize($b) or $a cmp $b}
 $splitted[0];
print @sorted;
close (_file_);

if I run it I get this output:

16$

whichi is the count of the rows.
Please point me to the right direction to find the solution.

ghostdog74 · April 10, 2009, 8:41pm

if its a database query, it will be easier to sort them as you do the query.

KevinADC · April 10, 2009, 11:30pm

Well , this is wrong:

my @splitted = split(/&/, @not_sorted);

Maybe this is what you are trying to do:

use strict;
use warnings;
open (_file_, "< path-to-file")  or  die "Failed to read file : $! ";
my @not_sorted = <_file_>; 
sub normalize {
   my $in = $_[0];
   $in = lc($in);
   $in =~ tr<aeiouu>
   <aeiouu>;
   $in =~ tr<abcdefghijklmnopqrs�tuvwxyz>
   <\x01-\x1B>;
   return $in;
}
my @sorted = sort {normalize($a) cmp normalize($b) || $a cmp $b} @not_sorted;
print @sorted;
close (_file_);

Note you should use "||" instead of "or" in the sort routine.

ahsog · April 11, 2009, 5:43am

Thank you again. For some reason, I didn't even think of custom sorting in my database (postgres), I'll check it out. But since I have to do some changes on the file when it comes out, I thought to do it together, and it is a good way to start learning Perl:). In fact, in the database the � is written as s'

But, KevinADC, the sorting worked just fine even with "or", but I need to let perl stop sorting at the first "&" otherwise I don't get good results, because the tipical line is:

alibumbi&\begin{CJK}{UTF8}{}\begin{SChinese}\end{SChinese}\end{CJK}&bithe alibuha&\begin{CJK}{UTF8}{}\begin{SChinese}\end{SChinese}\end{CJK}&56, 64\\

Or should I resort to awk? But I could not find help on custom sorting in awk.

ahsog · April 11, 2009, 11:34am

I've finally got the thing to work. Thank you all.
Here's the code.

use strict;
use warnings;
open (_file_, "< path-to-file")  or  die "Failed to read file : $! ";
my @not_sorted = <_file_>;
sub normalize {
   my $in = $_[0];
   $in = lc($in);
   $in =~ tr<aeiou>
   <aeiouu>;
   $in =~ tr<abcdefghijklmnopqrs�tuvwxyz>
   <\x01-\x1B>;
   return $in;
}
my @sorted = map {$_->[0]}
        sort{ normalize($a->[1]) cmp normalize($b->[1]) or $a->[1] cmp $b->[1]}
        map {chomp;[$_,split(/&/)]} @not_sorted;
print "$_\n" for @sorted;
close (_file_);

I did try this before but would not work because I was missing the round brackets here:

normalize($a->[1])

This is the test file:

bbc&aaa&aaa
mmn&aaa&ccc
lmn&bbb&aaa
aaa&ccc&ddd
���&&
sss&&aaa
zzz&&
aaa&bbb&ccc
aaa&aaa&bbb
uuu&&
�as&&
sa�&&
cab&&
uu&&
uu&&
&&

this is the output of the script:

aaa&ccc&ddd
aaa&bbb&ccc
aaa&aaa&bbb
bbc&aaa&aaa
cab&&
lmn&bbb&aaa
mmn&aaa&ccc
sa�&&
sss&&aaa
�as&&
���&&
uuu&&
uu&&
uu&&
&&
zzz&&

I'm a happy man:D

KevinADC · April 11, 2009, 12:38pm

use || instead of "or" in the sort. "or" is much lower precedence than || and in some circumstances, depeneding on how the code is written, "or" will not work properly because of that.

KevinADC · April 11, 2009, 12:44pm

You did good using the Schwarztian Transfrom to sort the data, but you're code doesn't take advantage of key caching, which makes the sort more efficient by calculating the sort keys only one time. Here it is modified to cache the sort keys:

use strict;
use warnings;
open (_file_, "< path-to-file")  or  die "Failed to read file : $! ";
my @not_sorted = <_file_>;
sub normalize {
   my $in = $_[0];
   $in = lc($in);
   $in =~ tr<aeiouu>
   <aeiouu>;
   $in =~ tr<abcdefghijklmnopqrs�tuvwxyz>
   <\x01-\x1B>;
   return $in;
}
my @sorted = map {$_->[0]}
        sort{ $a->[1] cmp $b->[1]}
        map {chomp;[$_,normalize((split(/&/))[1]) ]} @not_sorted;
print "$_\n" for @sorted;
close (_file_);

ahsog · April 11, 2009, 6:09pm

I take everything back, it still does not work. I tried to change [1] to [2] to see if it sees the "&", but I got this:

Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
Use of uninitialized value in string comparison (cmp) at zs line 14, <_file_> line 16.
���&&
sss&&aaa
zzz&&
uuu&&
�as&&
sa�&&
cab&&
uu&&
uu&&
&&
bbc&aaa&aaa
mmn&aaa&ccc
aaa&aaa&bbb
lmn&bbb&aaa
aaa&bbb&ccc
aaa&ccc&ddd

I've also tried it on the real file, and it does not work properly.

With your modifications I get this:

���&&
sss&&aaa
zzz&&
uuu&&
�as&&
sa�&&
cab&&
uu&&
uu&&
&&
bbc&aaa&aaa
mmn&aaa&ccc
aaa&aaa&bbb
lmn&bbb&aaa
aaa&bbb&ccc
aaa&ccc&ddd

I also tried to download from cpan Sort::Fields but cannot make it to work the way I expect. Sometimes you really feel ignorant.

KevinADC · April 11, 2009, 7:38pm

Seems to work for me:

use strict;
use warnings;
#open (_file_, "< path-to-file")  or  die "Failed to read file : $! ";
my @not_sorted = <DATA>;
sub normalize {
   my $in = $_[0];
   $in = lc($in);
   $in =~ tr<aeiouu>
   <aeiouu>;
   $in =~ tr<abcdefghijklmnopqrs�tuvwxyz>
   <\x01-\x1B>;
   return $in;
}
my @sorted = map {$_->[0]}
        sort{ $a->[1] cmp $b->[1]}
        map {chomp; [$_,normalize((split(/\&/))[0])]} @not_sorted;
print "$_\n" for @sorted;
#close (_file_);
__DATA__
bbc&aaa&aaa
mmn&aaa&ccc
lmn&bbb&aaa
aaa&ccc&ddd
���&&
sss&&aaa
zzz&&
aaa&bbb&ccc
aaa&aaa&bbb
uuu&&
�as&&
sa�&&
cab&&
uuu&&
uuu&&
uuu&&

output:

aaa&ccc&ddd
aaa&bbb&ccc
aaa&aaa&bbb
bbc&aaa&aaa
cab&&
lmn&bbb&aaa
mmn&aaa&ccc
sa�&&
sss&&aaa
�as&&
��&&
uuu&&
uuu&&
uuu&&
uuu&&
zzz&&

ahsog · April 11, 2009, 8:17pm

I'm starting to understand.
But please explain me the use of [1] and [0]:

        sort{ $a->[1] cmp $b->[1]}
        map {chomp; [$_,normalize((split(/\&/))[0])]} @not_sorted;

I find it a bit confusing/ [0] is the first line from left, [1] is the second and so on, right?
Why did you write [0] in the last map line and [1] in the sort line?

But it works! Also on the real file.
:D:D:D

KevinADC · April 12, 2009, 1:17pm

This is really the line that makes it all work:

map {chomp; [$_,normalize((split(/\&/))[0])]} @not_sorted;

What happens is the data from @not_sorted is stored in an anonymous array, thats the stuff inside the square brackets []. First each line is chomp()'d. Then a copy of each line "$_" is stored in the first position [0] of the anonymous array (thats what is returned in the last map block to the sorted array). Then each line is split(/&/) and just the first field of the split [0] is sent to the normalize() function. Whats returned from normalize is stored in the second position of the anonymous array [1]. Now all the sort keys are stored in the second position of the anonymous arrays (cached keys) and that is what gets sorted in the sort block. I hope that is clear, if not ask again and I will try and explain. I do have an article posted on another forum that tries to explain the technique in more detail:

Sorting Data with the Schwartzian Transform - bytes