File filter script

my_Perl · October 6, 2015, 12:59am

I need help to write a script to filter the input file INPUT.TXT as given below:

<DOC id="ID-NAME" type="story" >
<HEADLINE>
Relative Size Capital 
</HEADLINE>
<DATELINE>
Los , Monday 
</DATELINE>
<TEXT>
<P>
The first para consists of this format.have fully
</P>
<P>
Meanwhile, the rest of the story are in the XML format as in the present document format. 
</P>
</TEXT>
</DOC>

After filtering the above document, I want to get output as given below as OUTPUT.TXT:

Relative Size Capital 
Los , Monday 
The first para consists of this format.have fully
Meanwhile, the rest of the story are in the XML format as in the present document format.

Thanks in advance

RavinderSingh13 · October 6, 2015, 1:34am

Hello my_Perl,

You could try following for same.

awk '($0 !~ /</ && $0 !~ />/)' Input_file

Output will be as follows.

Relative Size Capital
Los , Monday
The first para consists of this format.have fully
Meanwhile, the rest of the story are in the XML format as in the present document format.

Thanks,
R. Singh

Don_Cragun · October 6, 2015, 1:45am

In case you have text between tags on the same line as text with tags, you could also try:

awk '{gsub(/<[^>]*>/, "")}$0!=""' INPUT.TXT

Note that this code (and the code RavinderSingh13 suggested) doesn't throw away trailing spaces at the end of input lines as you did in your desired output.

If that is important to you, you can add a call to sub() or gsub() after the call to gsub() to strip trailing (or leading, or both leading and trailing) spaces or spaces and tabs.

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

drl · October 6, 2015, 9:09am

Hi.

I like awk scripts, but I also like generality, as long as it's not difficult or complicated. If the files are formatted nicely into lines as shown in the OP, then basic awk scripts are fine. However, if the markup spans lines as shown below, then other solutions might be useful (and not a lot more difficult), as shown here:

#!/usr/bin/env bash

# @(#) s1	Demonstrate plain-text transformation of URLs.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk sed grep elinks

FILE=${1-data2}

pl " Input data file $FILE:"
cat $FILE

pl " Results, first awk:"
awk '($0 !~ /</ && $0 !~ />/)' $FILE

pl " Results, second awk:"
awk '{gsub(/<[^>]*>/, "")}$0!=""' $FILE

pl " Results, links (or elinks):"
links -dump $FILE

pl " Results, elinks (with added paragraph after headline):"
sed 's/<\/HEADLINE>/& <p>/' $FILE  > f1
elinks -dump f1

pl " Results, elinks (with added paragraph after headline, delete empy lines):"
sed 's/<\/HEADLINE>/& <p>/' $FILE  > f1
elinks -dump f1 |
grep -v '^[[:space:]]*$'

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
awk GNU Awk 3.1.5
sed GNU sed version 4.1.5
grep GNU grep 2.5.3
ELinks 0.11.4 (built on Sep 20 2008 16:40:51)

-----
 Input data file data2:
<DOC id="ID-NAME" type="story" > <HEADLINE> Relative Size Capital
</HEADLINE> <DATELINE> Los , Monday </DATELINE> <TEXT> <P> The first
para consists of this format.have fully </P> <P> Meanwhile, the rest
of the story are in the XML format as in the present document format.
</P> </TEXT> </DOC>

-----
 Results, first awk:
of the story are in the XML format as in the present document format.

-----
 Results, second awk:
  Relative Size Capital
  Los , Monday    The first
para consists of this format.have fully   Meanwhile, the rest
of the story are in the XML format as in the present document format.
  

-----
 Results, links (or elinks):
   Relative Size Capital Los , Monday

   The first para consists of this format.have fully

   Meanwhile, the rest of the story are in the XML format as in the present
   document format.

-----
 Results, elinks (with added paragraph after headline):
   Relative Size Capital

   Los , Monday

   The first para consists of this format.have fully

   Meanwhile, the rest of the story are in the XML format as in the present
   document format.

-----
 Results, elinks (with added paragraph after headline, delete empy lines):
   Relative Size Capital
   Los , Monday
   The first para consists of this format.have fully
   Meanwhile, the rest of the story are in the XML format as in the present
   document format.

The utility elinks is available in many repositories including CentOS, Debian, etc., and even one for the Mac (brew repository).

Best wishes ... cheers, drl