Extract date from html file

pkrabi78 · July 18, 2023, 8:29am

I have below content test.html file

<!DOCTYPE html>
<html>
<head>
<style>
table {
  font-family: arial, sans-serif;
  border-collapse: collapse;
  width: 100%;
}

td, th {
  border: 1px solid #dddddd;
  text-align: left;
  padding: 8px;
}

tr:nth-child(even) {
  background-color: #dddddd;
}
</style>
</head>
<body>

<h2>HTML Table</h2>

<table>
  <tr>
    <th>Company</th>
    <th>Contact</th>
    <th>Country</th>
  </tr>
  <tr>
    <td>Alfreds Futterkiste</td>
    <td>Maria Anders</td>
    <td>Germany</td>
  </tr>
  <tr>
    <td>Centro comercial Moctezuma</td>
    <td>Francisco Chang</td>
    <td>Mexico</td>
  </tr>
  <tr>
    <td>Ernst Handel</td>
    <td>Roland Mendel</td>
    <td>Austria</td>
  </tr>
  <tr>
    <td>Island Trading</td>
    <td>Helen Bennett</td>
    <td>UK</td>
  </tr>
  <tr>
    <td>Laughing Bacchus Winecellars</td>
    <td>Yoshi Tannamuri</td>
    <td>Canada</td>
  </tr>
  <tr>
    <td>Magazzini Alimentari Riuniti</td>
    <td>Giovanni Rovelli</td>
    <td>Italy</td>
  </tr>
</table>

</body>
</html>

I want extract the data from html file.

Alfreds Futterkiste,Maria Anders,Germany
Centro comercial Moctezuma,Chang,Mexico
Ernst Handel,Roland Mendel,Austria
.
.etc

Thanks in advance .

munkeHoller · July 18, 2023, 9:00am

can you show your code attempts please ?

thks

Paul_Pedant · July 18, 2023, 12:07pm

I think the opposite of a typo must be a reado. The question really says "content" (and does not seem to have been edited).

Nevertheless, the tags used in the HTML are crucial, so posting an image of the formatted query is of no use at all.

Or (sees the light) the markup in the HTML has been processed by posting it to the site, and I cannot see a way to get back the original html via an edit.

But "show your progress to date" is of course mandatory.

munkeHoller · July 18, 2023, 12:08pm

gracias, mea culpa, will edit my blurb - was on the mobile whilst reading ...

pkrabi78 · July 18, 2023, 2:32pm

I have tried like to get ..but no luck.

awk  -F '[<>]' '
/<td><b>Company<\/b><\/td>/ {                      ## To get company name 
    while (getline > 0 && /<td /) {
        gsub(/<b>/, ""); sub(/ .*/, "", $3)
        print $0
    }
    exit
}' mail1.html

Thanks

bendingrodriguez · July 18, 2023, 5:32pm

Hi @pkrabi78,

please provide a raw snippet of your HTML input, the CSS stuff isn't needed. Or create a small sample .html file and upload it.

pkrabi78 · July 18, 2023, 6:01pm

Thnanks,
Please find below contents of html file

Srvices properties

Property	Property Value
Rakesh	18-Jul-9999
Attended	0
Mail Received	Y
ENVIRONMENT	uat21

vgersh99 · July 18, 2023, 6:05pm

@pkrabi78,
please properly format your input data sample with markdown codes (described here).
We'll take another look at helping you once your previous post is properly formatted.
Good luck!

pkrabi78 · July 18, 2023, 7:59pm

html.txt (402 Bytes)

agama · July 19, 2023, 5:54am

This is the way that I'd approach it, at least to start. The solution pre-parses with sed to eliminate pesky carriage returns which your sample file had, and to break out HTML tags a bit. Breaking the tags out makes things a bit easier to parse, and should a bold tag be replaced by an ital tag in a field you won't need to specifically handle it.

#!/usr/bin/env bash
sed 's!\r!!; s!<!\n<!g; s!>!>\n!g;' $1  | awk '
        BEGIN {
           skip = 1        # start in skip mode
           ofs = ","       # output field separator
           cfs = ""        # current field separator
           tsep = ""       # token separator
        }

        NF == 0  { next }       # ignore empty lines

        /<\/TR/ || /<\/tr/ {    # close table row -- skip until next row
           skip = 1
           next
        }

        /<TR/ || /<tr/ {        # open row we start printing; turn skip off
           skip = 0
           if( cfs != "" ) {       # if current field sep not nil
                   printf( "\n" )  # finish the previous line
                   cfs = ""           # reset current sep
           }
           next
        }

        # uncomment to stop after first table
        #/<\/table/ || /<\\TABLE/ {
        #       exit( 0 );
        #}

        skip { next }

        /<TD/ || /<td/ {           # next column, print current field sep
           printf( "%s", cfs )
           cfs = ofs               # set sep for all remaining columns in row
           tsep = ""
           next
        }

        /</ { next }                    # all other tags are ignored

        {                               # not a tag, not skipping, print content
           printf( "%s%s", tsep $0 )  # edited, the original code had the parameters reversed
           tsep = " "
        }
'

Running the script will output all table rows as a comma separated set of fields. It does not assume a header and the header will be printed. It also assumes that trailing whitespace in a column (e.g. <b>property </b> is significant and it is not discarded. It also doesnt convert special characters which are escaped in the HTML (e.g. and ampersand), so it is just a start depending on how robust you need it to be.

Hope this gets you further in your efforts.

munkeHoller · July 19, 2023, 6:08am

@pkrabi78,

I edited your post and put the html within a code block (triple backticks)

```your html in here```

so that the raw content is visible, please do that going forwards.

pkrabi78 · July 19, 2023, 1:03pm

Thanks but . Could you please confirm to run the command e.g.``` command filename.htmll``?

agama · July 19, 2023, 1:22pm

If the script is html_parser.sh and your HTML file is report.html then you can run the script in one of these ways:

html_parser.sh report.html
html_arser.sh <report.html

If you have a command that generates the HTML and you want to pipe it directly into the script that is possible too:

html_gen_command | html_parser.sh

The $1 in the sed will read from the file name, if it is given on the command line, and will default to reading from standard input if the filename is not given. This gives the flexibility to use redirection to feed the HTML to the script.

Hope that makes sense, let me know if it does not.

bendingrodriguez · July 19, 2023, 6:25pm

Hi @pkrabi78,

here are some other suggestions:

#!/bin/bash

while read -r company && read -r contact && read -r country; do
    echo "$company,$contact,$country" | sed -r "s,</?td>,,g"
done < <(grep "<td>" infile)
# or
while read -r company && read -r contact && read -r country; do
    echo "$company,$contact,$country"
done < <(xmllint --xpath '//td/text()' infile)

#!/usr/bin/python3

import xml.etree.ElementTree as et

td = list(et.parse("infile").getroot().iter("td"))
for data in [td[i:i + 3] for i in range(0, len(td), 3)]:
    print(",".join(str(d.text) for d in data))

All suggestions are based on a fixed table width of 3 and exactly one <td> per line for the bash versions. If you want to use other widths (like in your upload file, which is improperly formatted btw, cause it's mssing some </tr>s), of course you have to adjust it accordingly. Can be done also dynamically, i.e. via a variable, but that requires a bit more effort in bash.

system · May 14, 2024, 6:26pm

This topic was automatically closed 300 days after the last reply. New replies are no longer allowed.