Hi.
I like awk scripts, but I also like generality, as long as it's not difficult or complicated. If the files are formatted nicely into lines as shown in the OP, then basic awk scripts are fine. However, if the markup spans lines as shown below, then other solutions might be useful (and not a lot more difficult), as shown here:
#!/usr/bin/env bash
# @(#) s1 Demonstrate plain-text transformation of URLs.
# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk sed grep elinks
FILE=${1-data2}
pl " Input data file $FILE:"
cat $FILE
pl " Results, first awk:"
awk '($0 !~ /</ && $0 !~ />/)' $FILE
pl " Results, second awk:"
awk '{gsub(/<[^>]*>/, "")}$0!=""' $FILE
pl " Results, links (or elinks):"
links -dump $FILE
pl " Results, elinks (with added paragraph after headline):"
sed 's/<\/HEADLINE>/& <p>/' $FILE > f1
elinks -dump f1
pl " Results, elinks (with added paragraph after headline, delete empy lines):"
sed 's/<\/HEADLINE>/& <p>/' $FILE > f1
elinks -dump f1 |
grep -v '^[[:space:]]*$'
exit 0
producing:
$ ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian 5.0.8 (lenny, workstation)
bash GNU bash 3.2.39
awk GNU Awk 3.1.5
sed GNU sed version 4.1.5
grep GNU grep 2.5.3
ELinks 0.11.4 (built on Sep 20 2008 16:40:51)
-----
Input data file data2:
<DOC id="ID-NAME" type="story" > <HEADLINE> Relative Size Capital
</HEADLINE> <DATELINE> Los , Monday </DATELINE> <TEXT> <P> The first
para consists of this format.have fully </P> <P> Meanwhile, the rest
of the story are in the XML format as in the present document format.
</P> </TEXT> </DOC>
-----
Results, first awk:
of the story are in the XML format as in the present document format.
-----
Results, second awk:
Relative Size Capital
Los , Monday The first
para consists of this format.have fully Meanwhile, the rest
of the story are in the XML format as in the present document format.
-----
Results, links (or elinks):
Relative Size Capital Los , Monday
The first para consists of this format.have fully
Meanwhile, the rest of the story are in the XML format as in the present
document format.
-----
Results, elinks (with added paragraph after headline):
Relative Size Capital
Los , Monday
The first para consists of this format.have fully
Meanwhile, the rest of the story are in the XML format as in the present
document format.
-----
Results, elinks (with added paragraph after headline, delete empy lines):
Relative Size Capital
Los , Monday
The first para consists of this format.have fully
Meanwhile, the rest of the story are in the XML format as in the present
document format.
The utility elinks
is available in many repositories including CentOS, Debian, etc., and even one for the Mac (brew repository).
Best wishes ... cheers, drl