I think the opposite of a typo must be a reado. The question really says "content" (and does not seem to have been edited).
Nevertheless, the tags used in the HTML are crucial, so posting an image of the formatted query is of no use at all.
Or (sees the light) the markup in the HTML has been processed by posting it to the site, and I cannot see a way to get back the original html via an edit.
But "show your progress to date" is of course mandatory.
@pkrabi78,
please properly format your input data sample with markdown codes (described here).
We'll take another look at helping you once your previous post is properly formatted.
Good luck!
This is the way that I'd approach it, at least to start. The solution pre-parses with sed to eliminate pesky carriage returns which your sample file had, and to break out HTML tags a bit. Breaking the tags out makes things a bit easier to parse, and should a bold tag be replaced by an ital tag in a field you won't need to specifically handle it.
#!/usr/bin/env bash
sed 's!\r!!; s!<!\n<!g; s!>!>\n!g;' $1 | awk '
BEGIN {
skip = 1 # start in skip mode
ofs = "," # output field separator
cfs = "" # current field separator
tsep = "" # token separator
}
NF == 0 { next } # ignore empty lines
/<\/TR/ || /<\/tr/ { # close table row -- skip until next row
skip = 1
next
}
/<TR/ || /<tr/ { # open row we start printing; turn skip off
skip = 0
if( cfs != "" ) { # if current field sep not nil
printf( "\n" ) # finish the previous line
cfs = "" # reset current sep
}
next
}
# uncomment to stop after first table
#/<\/table/ || /<\\TABLE/ {
# exit( 0 );
#}
skip { next }
/<TD/ || /<td/ { # next column, print current field sep
printf( "%s", cfs )
cfs = ofs # set sep for all remaining columns in row
tsep = ""
next
}
/</ { next } # all other tags are ignored
{ # not a tag, not skipping, print content
printf( "%s%s", tsep $0 ) # edited, the original code had the parameters reversed
tsep = " "
}
'
Running the script will output all table rows as a comma separated set of fields. It does not assume a header and the header will be printed. It also assumes that trailing whitespace in a column (e.g. <b>property </b> is significant and it is not discarded. It also doesnt convert special characters which are escaped in the HTML (e.g. and ampersand), so it is just a start depending on how robust you need it to be.
If you have a command that generates the HTML and you want to pipe it directly into the script that is possible too:
html_gen_command | html_parser.sh
The $1 in the sed will read from the file name, if it is given on the command line, and will default to reading from standard input if the filename is not given. This gives the flexibility to use redirection to feed the HTML to the script.
Hope that makes sense, let me know if it does not.
#!/bin/bash
while read -r company && read -r contact && read -r country; do
echo "$company,$contact,$country" | sed -r "s,</?td>,,g"
done < <(grep "<td>" infile)
# or
while read -r company && read -r contact && read -r country; do
echo "$company,$contact,$country"
done < <(xmllint --xpath '//td/text()' infile)
#!/usr/bin/python3
import xml.etree.ElementTree as et
td = list(et.parse("infile").getroot().iter("td"))
for data in [td[i:i + 3] for i in range(0, len(td), 3)]:
print(",".join(str(d.text) for d in data))
All suggestions are based on a fixed table width of 3 and exactly one <td> per line for the bash versions. If you want to use other widths (like in your upload file, which is improperly formatted btw, cause it's mssing some </tr>s), of course you have to adjust it accordingly. Can be done also dynamically, i.e. via a variable, but that requires a bit more effort in bash.