Perl: How do I remove leading non alpha characters

Juha · February 22, 2008, 1:15am

Hi,

Sorry for silly question, but I'm trying to write a perl script to operate a log file that is in following format:

(4)ab=1234/(10)bc=abcdef9876/cd=0....

The number in the brackets is the lenghts of the field, "/" is the field separator. Brackets are not leading every field.

What I'm trying to do is print the log in format:

ab=1234
bc=abcdef9876
cd=0

So far I've written the code below:

#!/bin/perl

$LOGFILE = "/path/to/logfile/filename.txt";
open(LOGFILE) or die("Could not open log file.");
foreach $line (<LOGFILE>) {
    
    @splitted = split(/\//, $line);
    
    foreach $element (@splitted){
        print "$element\n";
    }
}
close(LOGFILE);

However this prints out the leading brackets as well.

How can I get rid of the leading brackets?

Also the field may contain "/" e.g. "ef=a/b" how do I avoid this to be misinterpreted as the field separator?

Thanks!

KevinADC · February 22, 2008, 2:09am

Please show some of the real data, there might be a clue in it to help figure out a rule to use to split the lines up correctly. By what you posted it looks like you could use the field "names" (ab, bc, dc) to help split the fields up correctly, but I have a feeling that is psuedo data, not real data.

What kind of log file is this? There might already be a module written that understands the log format.

KevinADC · February 22, 2008, 2:15am

Anyways, for the lines with no forward slash in the values:

#!/bin/perl
use strict;
use warnings;

my $LOGFILE = "/path/to/logfile/filename.txt";
open(LOGFILE, $LOGFILE) or die "Could not open log file :$!";
while (<LOGFILE>) {
   chomp;
   my @fields = split(/\//);
   s/^\(\d*?\)// for @fields;
   print "$_\n" for @fields;
}
close(LOGFILE);

ghostdog74 · February 22, 2008, 3:45am

the simplest way to tackle this problem is at the source, by not using "/" as the field separator.

rikxik · February 22, 2008, 4:03am

Input File:

$ cat line.txt
(4)ab=1234/(10)bc=abcdef9876/cd=0
(4)ty=5234/(10)bc=abcdef9876/cd=0

Code:

perl -nle '/(\w+)=(\w+)/&&print "$1=$2"foreach split "/"' < line.txt

Output:

ab=1234
bc=abcdef9876
cd=0
ty=5234
bc=abcdef9876
cd=0

HTH

KevinADC · February 22, 2008, 2:28pm

rikxik:

Input File:

$ cat line.txt
(4)ab=1234/(10)bc=abcdef9876/cd=0
(4)ty=5234/(10)bc=abcdef9876/cd=0

Code:

perl -nle '/(\w+)=(\w+)/&&print "$1=$2"foreach split "/"' < line.txt

Output:

ab=1234
bc=abcdef9876
cd=0
ty=5234
bc=abcdef9876
cd=0

HTH

Considering he said the values can contain a forward slash it seems doubtful it will work.

Juha · February 22, 2008, 7:45pm

Thanks for the good tips given already

The data is just records of users accessing data. I don't think there is existing modules for this data as it is very specific for this log and not generally used.

the fields could have e.g. mt=image/gif (media type downloaded, could be any mime type really...)

Also there is field for browser type e.g. "bt=Mozilla/4"

A real example would look like this:

at=200802221200/cs=59278/(9)mt=image/gif/(9)bt=Mozilla/4...

Which tells the time of access to media (at) the content size (cs) media type (mt) and browser that was used to access the content (bt). There is about 100 different field names and they all are 2 letter combinations followed by "=" and then the value ending with the field separator "/", which I btw can't unfortunately change.

Maybe the data could be split somehow with the fieldnames like KevinADC suggested.

Thanks

KevinADC · February 22, 2008, 11:04pm

one possible way:

#!/bin/perl
use strict;
use warnings;

my $LOGFILE = "/path/to/logfile/filename.txt";
open(LOGFILE, $LOGFILE) or die "Could not open log file :$!";
while (<LOGFILE>) {
   chomp;
   s/\(\d+\)//g; # remove the (n) part
   s#/([a-z]{2}=)#:::$1#g; # convert delimiter to :::
   my @fields = split(/:::/); #split using new delimiter
   print "$_\n" for @fields;
}
close(LOGFILE);

Probably someone better with regexps can write something shorter and possibly more efficient using zero-width look ahead/behind assertions, which I am not too good with.

Juha · February 23, 2008, 12:52am

Thanks a lot!! that seems to work perfectly

rikxik · February 23, 2008, 10:10am

Input

$ cat line.txt
(8)xx=1234/xyz/at=200802221200/cs=59278/(9)mt=image/gif/(9)bt=Mozilla/4
(8)zz=9999/abc/at=200902221200/cs=59278/(9)mt=text/html/(9)bt=Mozilla/4

Code

perl -nle '(/^(\w+)=(\w+)/&&print "$1=$2")||(/^\w+$/&&print)||(/^\(/ && /(\w+)=(\w+)/&&printf "$1=$2/")foreach split /\//' < line.txt

Output

xx=1234/xyz
at=200802221200
cs=59278
mt=image/gif
bt=Mozilla/4
zz=9999/abc
at=200902221200
cs=59278
mt=text/html
bt=Mozilla/4