I have a file I've already partially pruned with grep that has data like:
<a href="MasterDetailResults.asp?textfield=a&Application=3D Home Architect 4">3D Home Architect 4</a> </td>
Approved </td>
--
<a href="MasterDetailResults.asp?textfield=a&Application=3d Home Architect 6">3d Home Architect 6</a> </td>
Not Approved </td>
--
<a href="MasterDetailResults.asp?textfield=a&Application=A to Zap">A to Zap</a> </td>
Approved </td>
--
except much, much more of it
I want to get the application name (i.e. 3D Home Architect 4) and the status (i.e. Approved or Not Approved) and turn it into this:
3D Home Architect 4|Approved
3d Home Architect 6|Not Approved
A to Zap|Approved
etc.
for use as a searchable database or import into Excel
I want to use bash scripting with sed or gawk to do this in the smallest number of lines (number of lines is not critical, of course
Thanks in advance for your help.
Try this:
awk -F"\"" '/Application=/{sub(".*a&","",$2);s=$2;getline;FS=" ";$0=$0;print s"|"$1}' file
Thanks Franklin52, that's a start. I got:
Application=3D Home Architect 4|Approved
Application=3d|Not
Application=A|Approved
when I ran it. I'll keep on working on it.
Hi,
try
sed -n '/Application/{N;s/.*Application=\([^"]*\).*\n\(.*\)<.*/\1 | \2/p}' file
If you sed doesn't support \n you have to write
sed -n '/Application/{N;s/.*Application=\([^"]*\).*\
\(.*\)<.*/\1 | \2/p}' file
instead.
HTH Chris
This should work:
awk -F"\"" '
/Application=/{
sub(".*=","",$2); s=$2
getline; sub(" <.*","")
print s "|" $0
}' file
perl:
undef $/;
open FH,"<d:/a.txt";
$str=<FH>;
@arr=split("--",$str);
map {s/<a.*>(.*)<\/a>(.*)<\/td>\n(.*)<\/td>/$1|$3/} @arr;
print "@arr";
close FH;
Thank you all for your solutions. I'm going to use Christoph Spohr's because I'm more comfortable with sed than I am with awk (although I know it's very powerful). I get an output with spaces after the pipe because there are spaces at the beginning of the line. How can I modify
sed -n '/Application/{N;s/.*Application=\([^"]*\).*\n\(.*\)<.*/\1 | \2/p}' file
to get rid of those spaces.
Also, what if my input file has another line between the two lines in question:
<tr>
<td height="23" align="default" valign="top">
<a href="MasterDetailResults.asp?textfield=a&Application=3D Home Architect 4">3D Home Architect 4</a> </td>
<td align="default" valign="top">
Approved </td>
</tr>
Once again, I need: Application Name|Status as my output. I've been removing the
<td align="default" valign="top">
line with sed before finishing things off with the sed code above.