extract strings between tags

userscript · August 4, 2009, 2:46pm

Hi,

I have data as follows in a text file

<key='data1'>
<String>abcdef</String>
<String>abcdef1</String>
<String>abcdef2</String>
</key>

<key='data2'>
<String>abcdef</String>
<String>abcdef1</String>
<String>abcdef2</String>
<String>abcdef3</String>
</key>

Is there a way i can just get entries between <String> </String> in the data1 tag?

Appreciate any help.

malcomex999 · August 4, 2009, 4:31pm

it would be better if u also post ur expected output but try this...

 
sed -n '/\<String\>/,/\<\/String\>/p' yourfile

drl · August 4, 2009, 11:23pm

Hi.

I don't use XMLish files, but I ran across this utility. if you have access to xml_grep, this task can be straight-forward. I modified your data file to put it into proper format and to differentiate between data1 and data2, then ran this script:

#!/usr/bin/env bash

# @(#) s1	Demonstrate extract data from XML file, xml_grep.
# Reference for XPath: http://en.wikipedia.org/wiki/XPath_1.0
# xml_grep: http://xmltwig.com/tool/

echo
set +o nounset
LC_ALL=C ; LANG=C ; export LC_ALL LANG
echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version "=o" $(_eat $0 $1) xml_grep
set -o nounset
echo

FILE=${1-data1}

echo " Data file $FILE:"
cat $FILE

echo
echo " Results:"
xml_grep --text_only --cond '*[@name="data1"]/String' $FILE

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0 
GNU bash 3.2.39
/usr/bin/xml_grep version 0.7

 Data file data1:
<project>
<key name="data1">
<String>abcdef</String>
<String>abcdef1</String>
<String>abcdef2</String>
</key>

<key name="data2">
<String>abcdefg</String>
<String>abcdefg1</String>
<String>abcdefg2</String>
<String>abcdefg3</String>
</key>
</project>

 Results:
abcdef
abcdef1
abcdef2

The xml_grep perl script was in the Debian repository for me. The site URL is listed in the script above. Good luck ... cheers, drl

edidataguy · August 5, 2009, 12:38am

 
sed -e 's/\(<[^<][^<]*>\)//g' file.xml
 
OR
 
sed -e 's/\(<[^<][^<]*>\)//g; /^$/d' file.xml

ghostdog74 · August 5, 2009, 6:21am

gawk

awk 'BEGIN{RS="";FS="</String>"}
/data1/{
 for(i=1;i<=NF;i++){
    if($i ~ /String/){
        gsub(/.*String>/,"",$i)
        print $i
    }    
 } 
}' file

userscript · August 5, 2009, 12:01pm

thank you all for your replies

for the sed comands i am getting this output

C:\Perl>sed -e 's/$<[^<][^<]*>$//g' dump.xml
The filename, directory name, or volume label syntax is incorrect.

The output is the same for all the sed commands.

I tried the awk code and i got this error

String found where operator expected at awk.pl line 9, near "}'"
(Might be a runaway multi-line '' string starting on line 1)
(Missing semicolon on previous line?)
syntax error at awk.pl line 9, near "}'"
Execution of awk.pl aborted due to compilation errors.

line 9 is the last line and i gave my filename there i.e., }' dump.xml. This is my 9th line.

Not sure what is wrong. Appreciate any help.

userscript · August 6, 2009, 1:34pm

Thanks for all the posts.I finally got it working. However, i am getting output from both the 'data1' and data2' tags.

My expected output is just from data1 tag - i.e,

abcdef
abcdef1
abcdef2

Thanks in advance for any help.

durden_tyler · August 6, 2009, 1:54pm

Here's one way to do it using Perl:

$
$ cat file1
<key='data1'>
<String>abcdef</String>
<String>abcdef1</String>
<String>abcdef2</String>
</key>
<key='data2'>
<String>ABCDEF</String>
<String>ABCDEF1</String>
<String>ABCDEF2</String>
<String>ABCDEF3</String>
</key>
$
$ perl -ne 'print $1,"\n" if $_ =~ m/data1/i...m/\/key/i and />(.*)</' file1
abcdef
abcdef1
abcdef2
$
$

tyler_durden

userscript · August 6, 2009, 3:06pm

I tried the following code, however it doesnt seem to work

open (FILE, "/path/dump.xml") || die ("Can't open dump.xml\n");
while (<FILE>)
{
#$sentence=~/data1/ - option1
$sentence="<key='data1'>"
#$sentence = "<key='data1'>" - option1
if($sentence eq "<key='data1'>")
{
sed -e 's/$<[^<][^<]*>$//g; /^$/d' dump.xml
}
else
{
print 'no match';
}
}

I am getting the following error but not able to figure out where the mistake is

syntax error at sed.pl line 9, near ")
{"
Execution of sed.pl aborted due to compilation errors.

Appreciate any help.

durden_tyler · August 6, 2009, 8:42pm

Well, you have put a sed command in what looks like a Perl script.

It's like adding a COBOL statement in a Java program.
Or a Visual Basic statement in a C program.

What do you think would happen ?

tyler_durden

edidataguy · August 6, 2009, 11:21pm

You said you finally got it working.
What was wrong and what did you fix?
Can you give details?

Now try this

sed '1,/<\/key>/! d; s/\(<[^<][^<]*>\)//g; /^$/d;' file.xml