Itinerate throught HTML table

valigula · March 22, 2010, 12:49pm

HI all,

<html>
<body>
  <div>
  <table id="orderList">
    <thead>
      <tr>
        <th>order number</th>
        <th>order type</th>
        <th>product type</th>
        <th>status</th>
        <th>status date</th>
      </tr>
    </thead>
    <tbody>
      <tr class="odd">
        <td><span id="orderLink">24978900</a></td>
        <td><span id="orderType">Provide</span></td>
        <td><span id="productType">Prod1</span></td>
        <td><span id="status">Complete</span></td>
        <td><span id="statusDate">18/12/09</span></td>
        <td><span id="bucket"></span></td>
      </tr><tr class="even">
        <td><span id="orderLink">27004805</a></td>
        <td><span id="orderType">Cease</span></td>
        <td><span id="productType"></span></td>
        <td><span id="status">Rejected</span></td>
        <td><span id="statusDate">17/02/10</span></td>
      </tr>
    </tbody>
  </table>

</div>
</body>
</html>

the desire result will be:

24978900
The last order that is "Complete"; order number is a seq so newer numbers are always at the botton.

Thanks

durden_tyler · March 22, 2010, 1:24pm

Here's one way to do it with Perl -

$
$ cat -n f5
     1  <html>
     2  <body>
     3    <div>
     4    <table id="orderList">
     5      <thead>
     6        <tr>
     7          <th>order number</th>
     8          <th>order type</th>
     9          <th>product type</th>
    10          <th>status</th>
    11          <th>status date</th>
    12        </tr>
    13      </thead>
    14      <tbody>
    15        <tr class="odd">
    16          <td><span id="orderLink">24978900</a></td>
    17          <td><span id="orderType">Provide</span></td>
    18          <td><span id="productType">Prod1</span></td>
    19          <td><span id="status">Complete</span></td>
    20          <td><span id="statusDate">18/12/09</span></td>
    21          <td><span id="bucket"></span></td>
    22        </tr><tr class="even">
    23          <td><span id="orderLink">27004805</a></td>
    24          <td><span id="orderType">Cease</span></td>
    25          <td><span id="productType"></span></td>
    26          <td><span id="status">Rejected</span></td>
    27          <td><span id="statusDate">17/02/10</span></td>
    28        </tr>
    29      </tbody>
    30    </table>
    31
    32  </div>
    33  </body>
    34  </html>
$
$ perl -lne 'BEGIN{undef $/}while (/.*<tr.*?"orderLink">(\d+)<.*?>Complete<.*?\/tr>.*/msg){print $1}' f5
24978900
$
$

tyler_durden

valigula · March 22, 2010, 1:31pm

Actually after playing a little bit with AWK, i found

$ awk -F"[><]" ' /orderLink/ { f=1; _ord=$5; } f && /status/ { $5="Complete"; f =0; print  _ord", " $5}'  /tmp/9054329.htm | tail -1

Thanks

drewk · March 22, 2010, 1:36pm

Is the HTML file local (ie, really a file) or on the web (ie, something you need to use wget to get?)

This PRE regex would get the number you want from that data:

/<html>.*?<td><span id="orderLink">(.*?)</a>/s

What do you mean by interate? What are the conditions for what you are looking for or rejecting?

Please be more specific.

durden_tyler · March 22, 2010, 1:37pm

On similar lines as the awk script -

$
$ perl -lne '/.*orderLink">(\d+)<.*/ and $x=$1; /.*>Complete<.*/ and print $x' f5
24978900
$

tyler_durden

valigula · March 22, 2010, 7:49pm

drewk:

Is the HTML file local (ie, really a file) or on the web (ie, something you need to use wget to get?)

This PRE regex would get the number you want from that data:
/<html>.*?<td><span id="orderLink">(.*?)</a>/s
What do you mean by interate? What are the conditions for what you are looking for or rejecting?

Please be more specific.

Hi drewk ,

The file it is already on my local machine , first use wget to login and download the page i was need. Did it this way mainly because did not know how to do it online ( without downloading the file). Some people mention using links maybe for the next version .
Sorry about my mispeling "itinerate".

Thanks

drewk · March 22, 2010, 8:46pm

OK -- itinerate

Try Tyler's perl script (either) with wget or curl:

curl "http://www.yururl.com" | perl -lne 'BEGIN{undef $/}while (/.*<tr.*?"orderLink">(\d+)<.*?>Complete<.*?\/tr>.*/msg){print $1}'

That will download and itinerate

valigula · March 22, 2010, 8:51pm

drewk:

OK -- itinerate

Try Tyler's perl script (either) with wget or curl:
curl "http://www.yururl.com" | perl -lne 'BEGIN{undef $/}while (/.*<tr.*?"orderLink">(\d+)<.*?>Complete<.*?\/tr>.*/msg){print $1}'
That will download and itinerate

Thanks, i will have a look.

valigula · March 25, 2010, 6:37pm

There is a new recuriments. I was ask not to search for Status = Completed but all the others differents thatn Rejected.

Can this be done using the current awk ?

$ awk -F"[><]" ' /orderLink/ { f=1; _ord=$5; } f && /Rejected/ {
_sta=$5; f=0; print _ord ","}' f1 | tail -1

Thanks in advance

durden_tyler · March 25, 2010, 8:44pm

I don't quite understand this statement. Do you want to fetch orderLinks -
(a) with "Rejected" status ?
(b) with statuses other than "Complete" and "Rejected" ?
(c) with statuses other than "Rejected" ?

I shall assume that you want (a).

Just try it on your HTML and see for yourself !
You have your HTML file, you have your awk script; what's stopping you from testing it out ?

Here's what I see when I run it on the HTML file you supplied in your first post -

$ 
$ cat -n f5
     1  <html>
     2  <body>
     3    <div>
     4    <table id="orderList">
     5      <thead>
     6        <tr>
     7          <th>order number</th>
     8          <th>order type</th>
     9          <th>product type</th>
    10          <th>status</th>
    11          <th>status date</th>
    12        </tr>
    13      </thead>
    14      <tbody>
    15        <tr class="odd">
    16          <td><span id="orderLink">24978900</a></td>
    17          <td><span id="orderType">Provide</span></td>
    18          <td><span id="productType">Prod1</span></td>
    19          <td><span id="status">Complete</span></td>
    20          <td><span id="statusDate">18/12/09</span></td>
    21          <td><span id="bucket"></span></td>
    22        </tr><tr class="even">
    23          <td><span id="orderLink">27004805</a></td>
    24          <td><span id="orderType">Cease</span></td>
    25          <td><span id="productType"></span></td>
    26          <td><span id="status">Rejected</span></td>
    27          <td><span id="statusDate">17/02/10</span></td>
    28        </tr>
    29      </tbody>
    30    </table>
    31
    32  </div>
    33  </body>
    34  </html>
$ 
$ awk -F"[><]" ' /orderLink/ { f=1; _ord=$5; } f && /Rejected/ { _sta=$5; f=0; print _ord ","}' f5
27004805,
$

Is this what you wanted ?

In any case, you could probably simplify the script thusly -

awk -F"[><]" '/orderLink/ {f=1; ord=$5} f && /Rejected/ {print ord}' f5

tyler_durden

valigula · March 26, 2010, 8:36am

Sorry for my terrible writting.

First i was asked to search for orderLinks with status = Completed. But it was too many exception (other statuses to be consider) , so know i rather do a "different than" Rejected instead.

In my first example i need to retrive the:
24978900
i added a grep at the end of the awk

awk -v telf="$1" -F"[><]" ' /orderLink/ { f=1; _ord=$5; } f && /productType/ {_pro=$5; f=1 ;} f && /status/ { $5; f=0; print telf","_ord", "_pro"," $5}' /tmp/$1.htm | grep -v Rejected

That returns all NOT Rejected,