Extracting specific characters from a text file

livos23 · April 10, 2009, 12:58am

I'm extremely new to scripting and linux in general, so please bear with me. The class I'm taking gives virtually no instruction at all, and so I'm trying to learn everything off the web.
Anyway, I'm trying to extract characters that follow after a specific pattern ( '<B><FONT FACE="Arial">' ) but before '<' in a text file. I'm having trouble because there are no spaces, so I can't use $. I'm not even sure what kind of commands I should be using. I tried working with awk, but that didn't get me exactly what I want. Now I'm trying to figure out other ways to do this, but I really have no idea where to start. Any help is greatly appreciated.

Franklin52 · April 10, 2009, 9:44am

Maybe this throws you too deep in sed but for scripting sed/awk is indispensable IMHO.

sed 's/.*">\(.*\)<.*/\1/'

It's an ordinary sed substitute command in this form: sed 's/remove this/with this/'

With sed you can save substrings with $.*$ and recall them back with \1, \2, \3 etc.

The command isolates the piece after "> and before the last < in the substring $.*$ and recalls it back with \1.

Regards

fpmurphy · April 10, 2009, 10:20am

Here is an example of another way of doing it using sed

TMP=file.$$

cat <<EOT >$TMP
<header>
<description>This is description</description>
<content><B><FONT FACE="Arial">hello livos23</FONT></B></content>
</header>
EOT

# sed by default is greedy and removes up to last >
var=$(sed -n 's/\(<description>\)\([[:print:]]*\)<\/[^>]*>/\2/p' $TMP)
printf "$var\n"

# more general case
var=$(sed -n 's/^.*<B><FONT FACE="Arial">\([[:print:]][^<]*\).*$/\1/p' $TMP)
printf "$var\n"

rm $TMP

exit 0

The output is

This is description
hello livos23

quirkasaurus · April 10, 2009, 10:25am

sed -e 's/<[^>]*>//g' e.html

remove everything between < and > that's not a >.