Turn HTML data into delimited text

macxcool · November 21, 2008, 2:58pm

I have a file I've already partially pruned with grep that has data like:

    &lt;a href="MasterDetailResults.asp?textfield=a&Application=3D Home Architect 4"&gt;3D Home Architect 4&lt;/a&gt; &lt;/td&gt;
    Approved &lt;/td&gt;

--
<a href="MasterDetailResults.asp?textfield=a&Application=3d Home Architect 6">3d Home Architect 6</a> </td>
Not Approved </td>
--
<a href="MasterDetailResults.asp?textfield=a&Application=A to Zap">A to Zap</a> </td>
Approved </td>
--

except much, much more of it

I want to get the application name (i.e. 3D Home Architect 4) and the status (i.e. Approved or Not Approved) and turn it into this:

3D Home Architect 4|Approved
3d Home Architect 6|Not Approved
A to Zap|Approved
etc.

for use as a searchable database or import into Excel

I want to use bash scripting with sed or gawk to do this in the smallest number of lines (number of lines is not critical, of course

Thanks in advance for your help.

Franklin52 · November 21, 2008, 3:45pm

Try this:

awk -F"\"" '/Application=/{sub(".*a&","",$2);s=$2;getline;FS=" ";$0=$0;print s"|"$1}' file

macxcool · November 21, 2008, 4:47pm

Thanks Franklin52, that's a start. I got:
Application=3D Home Architect 4|Approved
Application=3d|Not
Application=A|Approved
when I ran it. I'll keep on working on it.

Christoph_Spohr · November 21, 2008, 5:47pm

Hi,

try

sed -n '/Application/{N;s/.*Application=\([^"]*\).*\n\(.*\)<.*/\1 | \2/p}' file

If you sed doesn't support \n you have to write

sed -n '/Application/{N;s/.*Application=\([^"]*\).*\
\(.*\)<.*/\1 | \2/p}' file

instead.

HTH Chris

Franklin52 · November 22, 2008, 7:05am

This should work:

awk -F"\"" '
/Application=/{
  sub(".*=","",$2); s=$2
  getline; sub(" <.*","")
  print s "|" $0
}' file

summer_cherry · November 23, 2008, 5:34am

perl:

undef $/;
open FH,"<d:/a.txt";
$str=<FH>;
@arr=split("--",$str);
map {s/<a.*>(.*)<\/a>(.*)<\/td>\n(.*)<\/td>/$1|$3/} @arr;
print "@arr";
close FH;

macxcool · November 24, 2008, 9:50am

Thank you all for your solutions. I'm going to use Christoph Spohr's because I'm more comfortable with sed than I am with awk (although I know it's very powerful). I get an output with spaces after the pipe because there are spaces at the beginning of the line. How can I modify

sed -n '/Application/{N;s/.*Application=\([^"]*\).*\n\(.*\)<.*/\1 | \2/p}' file

to get rid of those spaces.
Also, what if my input file has another line between the two lines in question:

    <tr> 
      <td height="23" align="default" valign="top"> 
        <a href="MasterDetailResults.asp?textfield=a&Application=3D Home Architect 4">3D Home Architect 4</a> </td>
      <td align="default" valign="top"> 
        Approved </td>
    </tr>

Once again, I need: Application Name|Status as my output. I've been removing the
<td align="default" valign="top">
line with sed before finishing things off with the sed code above.