perl basic multiple pattern matching

sinusoid · November 1, 2010, 2:47pm

Hi everyone, and thank you for your help with this. I am VERY new with perl so all of your help is appreciated. I have tried google but as I don't know the proper terms to search for and could be daunting for a newbie scripter... I know this is very easy for most of you! Thanks!

I have a multi-gig file of the repeated format:

<form name="profileForm" action="/profile.php" method="post">		

<style type="text/css">

.required {
	font-size: small;
	color: #f00;
}

</style>

<table border="0" cellpadding="3" cellspacing="1">
<tr>
	<td>�</td>
	<td class="label">First Name</td>
	<td class="label">Last Name</td>
</tr>
<tr valign="top">
	<td class="label">Name: <span class="required">*</span></td>
	<td class="row"><input type="text" maxlength="20" name="firstName" size="20" value="su"  /></td>
	<td class="row"><input type="text" maxlength="20" name="lastName" size="20" value="chingping"   />
	<input type="hidden" name="customerNumber" value=""  /></td>
</tr>
<tr valign="top">
	<td class="label">Job Title: <span class="required">*</span></td>
	<td colspan="2" class="row"><input type="text" maxlength="30" name="jobTitleOther" size="30" value="miss"  /></td>
</tr>

<tr valign="top">
	<td class="label">Company: <span class="required">*</span></td>
	<td colspan="2" class="row"><input type="text" maxlength="30" name="company" size="30" value="omd"  /></td>
</tr>

I want to use perl to read in this text file, "out1.txt" and (parse?) it into the values, firstname, last name, job title, company, etc. etc. and output to a csv file

I know that for each of these values, they occur within a specific pattern eg. the "Company" value I want will be always be <td colspan="2" class="row"><input type="text" maxlength="30" name="company" size="30" value="HERE" /></td>. And the other patterns will occur in the same place in similar strings. I know ALL records will exist for each "person"

Is there a good script that is already written that is close OR can someone help me formulate from this to perl :

Open file
Read in each line 
While new line exists,
If pattern is (for example) 	<td class="row"><input type="text" maxlength="20" name="firstName" size="20" value="su"  /></td>  
 output "firstName" to the first csv field, or if pattern 
is <td colspan="2" class="row"><input type="text" maxlength="30" name="company" size="30" value="HERE" output value "HERE"  to the third csv field,

I am just looking for basic framework for one or two sequential patterns, the while loop, etc.

The problems for me is matching values in a specific location of multiple known strings in sequential order and putting them into a csv file.

Thanks for your help!

turk451 · November 2, 2010, 1:12am

Just one of a million Perl solutions:

#!/usr/bin/perl -w

open(IN,"out1.txt") || die("Could not open infile!");
open(OUT,">extract.txt") || die("Could not open outfile!");
foreach $line (<IN>) {
  if (rindex($line,"firstName") > -1) {
    @splitLine = split(/"/, $line);
    print(OUT $splitLine[11].",");
  } elsif (rindex($line,"lastName") > -1) {
    @splitLine = split(/"/, $line);
    print(OUT $splitLine[11].",");
  } elsif (rindex($line,"jobTitleOther") > -1) {
    @splitLine = split(/"/, $line);
    print(OUT $splitLine[13].",");
  } elsif (rindex($line,"company") > -1) {
    @splitLine = split(/"/, $line);
    print(OUT $splitLine[13]."\n");
  }
}

sinusoid · November 2, 2010, 8:32am

trying now ---- you are a LIFE saver.

---------- Post updated at 08:32 AM ---------- Previous update was at 08:00 AM ----------

okay -- so quick question so I can modify

Can someone quickly explain

for

if (rindex($line,"firstName") > -1) {
   @splitLine = split(/"/, $line);

is it indexing the last character position in "firstName" and then splitting on that, or is the split(/"/" a regex expression... not sure

durden_tyler · November 2, 2010, 9:00am

Yet another Perl solution:

$
$ # show the content of the input data file "f0"
$ cat f0
<form name="profileForm" action="/profile.php" method="post">
<style type="text/css">
.required {
        font-size: small;
        color: #f00;
}
</style>
<table border="0" cellpadding="3" cellspacing="1">
<tr>
        <td>�</td>
        <td class="label">First Name</td>
        <td class="label">Last Name</td>
</tr>
<tr valign="top">
        <td class="label">Name: <span class="required">*</span></td>
        <td class="row"><input type="text" maxlength="20" name="firstName" size="20" value="su"  /></td>
        <td class="row"><input type="text" maxlength="20" name="lastName" size="20" value="chingping"   />
        <input type="hidden" name="customerNumber" value=""  /></td>
</tr>
<tr valign="top">
        <td class="label">Job Title: <span class="required">*</span></td>
        <td colspan="2" class="row"><input type="text" maxlength="30" name="jobTitleOther" size="30" value="miss"  /></td>
</tr>
<tr valign="top">
        <td class="label">Company: <span class="required">*</span></td>
        <td colspan="2" class="row"><input type="text" maxlength="30" name="company" size="30" value="omd"  /></td>
</tr>
$
$ # run the Perl one-liner on the file "f0"
$ perl -lne 'if (/.*name="(firstName|lastName|jobTitleOther|company)".*?value="(.*?)"/) {
               $x .= ",$2";
               if ($1 eq "company") {print substr($x,1); $x=""}
             }' f0
su,chingping,miss,omd
$
$
$

tyler_durden

---------- Post updated at 09:00 AM ---------- Previous update was at 08:39 AM ----------

sinusoid:

...
Can someone quickly explain

for
if (rindex($line,"firstName") > -1) {
   @splitLine = split(/"/, $line);
 
is it indexing the last character position in "firstName" and then splitting on that, or is the split(/"/" a regex expression...

The "rindex" function in this expression:

rindex (str, substr)

returns the position of the last (i.e. rightmost) occurrence of substr in str.
If substr doesn't exist in str, then it returns -1.

So the condition -

if (rindex($line,"firstName") > -1) {

checks if the rightmost index of "firstName" in $line is greater than -1. In other words, it checks if "firstName" exists in $line.

If it does, then this statement -

   @splitLine = split(/"/, $line);

splits $line on the literal double-quotes character and assigns the tokens (or split elements) to the array "@splitLine".

As an example:

@x = split (/:/, "abc:def:ghijk:l")

will split the string "abc:def:ghijk:l" on the literal semi-colon character (":") and assign the split elements to the array @x. So, after that operation, @x will have-

"abc" at index 0,
"def" at index 1,
"ghijk" at index 2 and
"l" at index 3.

The "//" in the split function allows regexes to be used, instead of literal characters. So, for instance, if the string you want to split is "a b c d e", and the number of spaces between the elements is variable, then you can use a regex in the split condition like so:

$
$
$ perl -le '@x = split(/[ ]+/, "a       b  c    d      e"); print $_ foreach (@x)'
a
b
c
d
e
$
$

You could use double-quotes instead of "//".

After $line is split on double-quotes and assigned to @splitLine, the value of "firstName" is the 11 element of that array.

HTH,
tyler_durden

turk451 · November 3, 2010, 8:31pm

I like Tyler's solution better