Turn HTML data into delimited text

I have a file I've already partially pruned with grep that has data like:

    <a href="MasterDetailResults.asp?textfield=a&Application=3D Home Architect 4">3D Home Architect 4</a> </td>
    Approved </td>

--
<a href="MasterDetailResults.asp?textfield=a&Application=3d Home Architect 6">3d Home Architect 6</a> </td>
Not Approved </td>
--
<a href="MasterDetailResults.asp?textfield=a&Application=A to Zap">A to Zap</a> </td>
Approved </td>
--

except much, much more of it :wink:

I want to get the application name (i.e. 3D Home Architect 4) and the status (i.e. Approved or Not Approved) and turn it into this:

3D Home Architect 4|Approved
3d Home Architect 6|Not Approved
A to Zap|Approved
etc.

for use as a searchable database or import into Excel

I want to use bash scripting with sed or gawk to do this in the smallest number of lines (number of lines is not critical, of course :wink:

Thanks in advance for your help.

Try this:

awk -F"\"" '/Application=/{sub(".*a&","",$2);s=$2;getline;FS=" ";$0=$0;print s"|"$1}' file

Thanks Franklin52, that's a start. I got:
Application=3D Home Architect 4|Approved
Application=3d|Not
Application=A|Approved
when I ran it. I'll keep on working on it.

Hi,

try

sed -n '/Application/{N;s/.*Application=\([^"]*\).*\n\(.*\)<.*/\1 | \2/p}' file

If you sed doesn't support \n you have to write

sed -n '/Application/{N;s/.*Application=\([^"]*\).*\
\(.*\)<.*/\1 | \2/p}' file

instead.

HTH Chris

This should work:

awk -F"\"" '
/Application=/{
  sub(".*=","",$2); s=$2
  getline; sub(" <.*","")
  print s "|" $0
}' file

perl:

undef $/;
open FH,"<d:/a.txt";
$str=<FH>;
@arr=split("--",$str);
map {s/<a.*>(.*)<\/a>(.*)<\/td>\n(.*)<\/td>/$1|$3/} @arr;
print "@arr";
close FH;

Thank you all for your solutions. I'm going to use Christoph Spohr's because I'm more comfortable with sed than I am with awk (although I know it's very powerful). I get an output with spaces after the pipe because there are spaces at the beginning of the line. How can I modify

sed -n '/Application/{N;s/.*Application=\([^"]*\).*\n\(.*\)<.*/\1 | \2/p}' file

to get rid of those spaces.
Also, what if my input file has another line between the two lines in question:

    <tr> 
      <td height="23" align="default" valign="top"> 
        <a href="MasterDetailResults.asp?textfield=a&Application=3D Home Architect 4">3D Home Architect 4</a> </td>
      <td align="default" valign="top"> 
        Approved </td>
    </tr>

Once again, I need: Application Name|Status as my output. I've been removing the
<td align="default" valign="top">
line with sed before finishing things off with the sed code above.