How to extract data from BNC xml with reference brackets?

Johnivy · December 11, 2008, 12:36am

I have data like the following pattern:
<change date="2000-01-09" who="#OUCS">Updated all catrefs</change>

<change date="2000-01-08" who="#OUCS">Manually updated tagcounts, titlestmt, and title in source</change>

<change date="1999-09-13" who="#UCREL">POS codes revised for BNC-2; header updated</change>

<change date="1994-11-24" who="#dominic">Initial accession to corpus</change>

</revisionDesc>
</teiHeader>
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <wtext type="NONAC">
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <div level="1" n="1" type="leaflet">
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <head type="MAIN">
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <s n="1">
<w c5="NN1" hw="factsheet" pos="SUBST">FACTSHEET</w>

</s>

</head>

[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <p>
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <s n="2">
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <hi rend="bo">
<w c5="NN1" hw="aids" pos="SUBST">AIDS</w>

<w c5="VVN-AJ0" hw="acquire" pos="VERB">Acquired</w>

<w c5="AJ0" hw="immune" pos="ADJ">Immune</w>

<w c5="NN1" hw="deficiency" pos="SUBST">Deficiency</w>

<w c5="NN1" hw="syndrome" pos="SUBST">Syndrome</w>

</hi>

<w c5="NN1" hw="condition" pos="SUBST">condition</w>

<w c5="VVN" hw="cause" pos="VERB">caused</w>

Then in order extract those patterns like
<w c5="(.?)" hw="(.?)" pos="(.*?)">(.?)</w>.
First, I wirte the following command sed 's/<w c5="$.?$" hw="$.?$" pos="$.*?$">$.?$<\/w>/\1:\4/g' A00.xml.
However, the result is like this which is not what I want:
<s n="420"><w c5="NN1" hw="aids" pos="SUBST">AIDS </w><w c5="NN1-VVB" hw="care" pos="SUBST">Care </w><w c5="NN1" hw="education" pos="SUBST">Education </w><w c5="CJC" hw="and" pos="CONJ">and </w><w c5="NN1" hw="training" pos="SUBST">Training </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="AT0" hw="a" pos="ART">a </w><w c5="NN1" hw="company" pos="SUBST">company </w><w c5="VVN" hw="limit" pos="VERB">limited </w><w c5="PRP" hw="by" pos="PREP">by </w><w c5="NN1" hw="guarantee" pos="SUBST">guarantee</w><c c5="PUN">.</c></s>

Seem the replacement doesn't work.

I want the result like these for all those patterns <w c5="(.?)" hw="(.?)" pos="(.*?)">(.*?)</w>

NN1:FACTSHEET
DTQ:WHAT
VBZ:IS
NN1:AIDS

Second, I try awk '/<w c5="(.?)" hw="(.?)" pos="(.*?)">(.*?)<\/w>/ {print $1,$2,$3,$4}' A00.xml. However, the result is not what I want. They didn't print out those parts within ().

How can we just extract and grep those parts within () which is used to defined the parts I need to extract?

Thanks all of your suggestion
John

Annihilannic · December 14, 2008, 6:57pm

awk doesn't use that kind of syntax to assign matches to subexpressions... you must have seen that in perl somewhere?

Your code works with only minor modifications in perl:

perl -ne '
        if (/<w c5="(.*?)" hw="(.*?)" pos="(.*?)">(.*?)<\/w>/) {print "$1,$2,$3,$4\n"}
' inputfile > outputfile

Johnivy · December 14, 2008, 7:49pm

In this book Title:Unix Power Tools, Third Edition
URL:Amazon.com: Unix Power Tools, Third Edition: Shelley Powers, Jerry Peek, Tim O'Reilly, Mike Loukides: Books
ISBN:0596003307
Author:Shelley Powers / Jerry Peek / Tim O'Reilly / Mike Loukides
Publisher:O'Reilly & Associates
Page:1200 pages
Edition:3rd edition (October 1, 2002)

32.13 Regular Expressions: Remembering Patterns with \ (, \ ), and \1
Another pattern that requires a special mechanism is searching for repeated words. The expression [a-z][a-z] will match any two lowercase letters. If you wanted to search for lines that had two adjoining identical letters, the above pattern wouldn't help. You need a way to remember what you found and see if the same pattern occurs again. In some programs, you can mark part of a pattern using $ and $. You can recall the remembered pattern with \ followed by a single digit.[4] Therefore, to search for two identical letters, use $[a-z]$\1. You can have nine different remembered patterns. Each occurrence of $ starts a new pattern. The regular expression to match a five-letter palindrome (e.g., "radar") is: \([a-z]$$[a-z]$[a-z]\2\1. [Some versions of some programs can't handle  in the same regular expression as \1, etc. In all versions of sed, you're safe if you use  on the pattern side of an s command � and \1, etc., on the replacement side (Section 34.11). � JP]

� BB
34.11 Referencing Portions of a Search String
In sed, the substitution command provides metacharacters to select any individual portion of a string that is matched and recall it in the replacement string. A pair of escaped parentheses are used in sed to enclose any part of a regular expression and save it for recall. Up to nine "saves" are permitted for a single line. \n is used to recall the portion of the match that was saved, where n is a number from 1 to 9 referencing a particular "saved" string in order of use. (Section 32.13 has more information.)

For example, when converting a plain-text document into HTML, we could convert section numbers that appear in a cross-reference into an HTML hyperlink. The following expression is broken onto two lines for printing, but you should type all of it on one line:

s/$[sS]ee $$Section $$[1-9][0-9]*$\.$[1-9][0-9]*$/
\1<a href="#SEC-\3_\4">\2\3.\4<\/a>/
Four pairs of escaped parentheses are specified. String 1 captures the word see with an upper- or lowercase s. String 2 captures the section number (because this is a fixed string, it could have been simply retyped in the replacement string). String 3 captures the part of the section number before the decimal point, and String 4 captures the part of the section number after the decimal point. The replacement string recalls the first saved substring as \1. Next starts a link where the two parts of the section number, \3 and \4, are separated by an underscore (_) and have the string SEC- before them. Finally, the link text replays the section number again � this time with a decimal point between its parts. Note that although a dot (.) is special in the search pattern and has to be quoted with a backslash there, it's not special on the replacement side and can be typed literally. Here's the script run on a short test document, using checksed (Section 34.4):

% checksed testdoc
********** < = testdoc > = sed output **********
8c8
< See Section 1.2 for details.
---
> See <a href="#SEC-1_2">Section 1.2</a> for details.
19c19
< Be sure to see Section 23.16!
---
> Be sure to see <a href="#SEC-23_16">Section 23.16</a>!
We can use a similar technique to match parts of a line and swap them. For instance, let's say there are two parts of a line separated by a colon. We can match each part, putting them within escaped parentheses and swapping them in the replacement:

% cat test1
first:second
one:two
% sed 's/$.*$:$.*$/\2:\1/' test1
second:first
two:one
The larger point is that you can recall a saved substring in any order and multiple times. If you find that you need more than nine saved matches, or would like to be able to group them into matches and submatches, take a look at Perl.

Section 43.10, Section 31.10, Section 10.9, and Section 36.23 have examples.

�DD and JP

I test it it works for a list of lines in the same pattern. The problem in my situation is that I fail to in the first step put all the content of this regular expression <w c5="(.?)" hw="(.?)" pos="(.*?)">(.*?)<\/w>/in each individual line such as <w c5="NN1" hw="factsheet" pos="SUBST">FACTSHEET</w>

My result is not that clear which contains other contents out of the regular expression such as <s n="420">.

To my strange, it works in that book's example but not in my situation.

Best

John

Annihilannic · December 15, 2008, 5:54pm

Sorry, I can't make sense of what you're saying.

However I notice you described in your original post that you wanted the output in this format:

NN1:FACTSHEET
DTQ:WHAT
VBZ:IS
NN1:AIDS

So try this instead:

perl -ne '
        if (/<w c5="(.*?)" hw=".*?" pos=".*?">(.*?)<\/w>/) {print "$1:$2\n"}
' inputfile > outputfile

Johnivy · December 15, 2008, 7:46pm

Thanks first.

The first six paragraphs are quoted from a book which introduce how to use sed with parentheses. I don't know why it won't works in my situation.

Best
John

Annihilannic · December 15, 2008, 8:11pm

What operating system are you using? I think the .*? parts may be the problem, as that regular expression syntax is not supported by most implementations of sed. It may work with GNU sed, the version found on Linux.

Try this, which works for me on HP-UX:

sed -n 's/<w c5="\(.*\)" hw="\(.*\)" pos="\(.*\)">\(.*\)<\/w>/\1:\4/gp' inputfile > outputfile

Johnivy · December 15, 2008, 8:59pm

I am using SSH secure Shell Client on Windows xp. Then the Shell cilent is connected to our Unix server in our school

Johnivy · December 15, 2008, 9:15pm

I have tried
sed -n 's/<w c5="$.$" hw="$.$" pos="$.*$">$.*$<\/w>/\1:\4/gp' A00.xml < test.txt

as you suggested. yet the output file of test.txt is still a mess. Here I attach the original file and the output file.

Is it that we haven't grepped all the contents in the pattern of <w c5="$.$" hw="$.$" pos="$.*$">$.*$</w> in a list form.

I try egrep. yet not working

Thanks for your discussion and instruction

Annihilannic · December 15, 2008, 9:50pm

This sounds like an assignment or homework, which is against the forum rules.

The problem is that "." does a greedy match, and you have multiple matches on each line of data, so you need to handle that. Try something like \"$[^"]$\" instead of "$.$" to limit the match to the contents of the speech marks. [^"] means any number of characters excluding ".

You will need some additional search and replaces to remove the <s ...> and <c ...> </c> tags, but I'll leave that as an exercise for you.

Johnivy · December 16, 2008, 5:30am

Thanks for your reply.

First, this is not homework or assignment. I am researching on corpus linguistics and try to find an effective way for collecting data. To remember items with parentheses are seldom mention in many examples.

I am afraid your version sed -n 's/<w c5="$.$" hw="$.$" pos="$.*$">$.*$<\/w>/\1:\4/gp' inputfile > outputfile doesn't work.

First, with/gp , the items in my results are repeated or doubled.

Second, I try sed 's/<w c5="$.$" hw="$.$" pos="$.*$">$.*$<\/w>/\1:\4/' test2.txt. Then it works. The content of test2.txt is like:
<w c5="VBZ" hw="be" pos="VERB">is</w>
<w c5="AT0" hw="a" pos="ART">a</w>
<w c5="NN1" hw="condition" pos="SUBST">condition</w>
<w c5="VVN" hw="cause" pos="VERB">caused</w>
<w c5="PRP" hw="by" pos="PREP">by</w>
<w c5="AT0" hw="a" pos="ART">a</w>
<w c5="NN1" hw="virus" pos="SUBST">virus</w>
<w c5="VVN" hw="call" pos="VERB">called</w>
<w c5="NP0" hw="hiv" pos="SUBST">HIV</w>
Then the result is

VBZ:is
AT0:a
NN1:condition
VVN:caused
PRP:by
AT0:a
NN1:virus
VVN:called
NP0:HIV

That means the sed only works for worklis like the above words in red part. Moreover, if we use "need some additional search and replaces to remove the <s ...> and <c ...> </c> tags", this may not be the best way.

I don't why it won't work for my whole file A00.xml

Best
John

Johnivy · December 16, 2008, 5:45am

I try to collect first those content like <w c5="." hw="." pos=".*?">.*</w> in that A00.xml.

I use the following pattern :

egrep "<w c5="." hw="." pos=".*?">.*</w>" A00.xml

The result is:

First, there is unexpected part <s n=...>

Second, they are not in list form like this:
<w c5="PNP" hw="we" pos="PRON">We </w>
<w c5="VVB" hw="make" pos="VERB">make </w>
<w c5="AT0" hw="the" pos="ART">the </w>
<w c5="DT0" hw="most" pos="ADJ">most </w>
<w c5="PRF" hw="of" pos="PREP">of </w>

Annihilannic · December 16, 2008, 5:46pm

I'm glad it's not homework, I just though I should check because we get a lot of posts like that here.

You don't seem to have tried what I suggested in my previous post to prevent greedy matching?

Is there any particular reason why you want to use sed? This perl one-liner seems to do what you require, as I understand it anyway:

perl -ne 'while (/<w c5="(.*?)" hw=".*?" pos=".*?">(.*?)<\/w>/g) {print $1:$2\n"}' A00.xml > outputfile

Johnivy · December 16, 2008, 7:17pm

annihilannic:

I'm glad it's not homework, I just though I should check because we get a lot of posts like that here.

You don't seem to have tried what I suggested in my previous post to prevent greedy matching?

Is there any particular reason why you want to use sed? This perl one-liner seems to do what you require, as I understand it anyway:
perl -ne 'while (/<w c5="(.*?)" hw=".*?" pos=".*?">(.*?)<\/w>/g) {print $1:$2\n"}' A00.xml > outputfile

First, we have unix system installed in a server. we have many xml files as big as 4 G to process. Then I think the server can process them much faster than my desktop computer. Second, I 'v e not learned perl before and am afraid that it will assump too much of my time to learn a new script language. Third, I try other GNU softwares such as powergrep and textpipe. Yet they take money to buy after evaluation period. As far as my understanding, they offer similar functions for extract data according to regular expression. Then I want to make full use of the unix tool sed, awk , and grep to reach teh same functions like what these program do.

Annihilannic · December 16, 2008, 7:33pm

I don't know why you're telling me about your Unix system... I assumed you were doing this on Unix anyway? perl is widely found on Unix systems, and is more efficient at processing large amounts of data, so I would say it is ideal for your purposes (and very useful to learn!).

sed, awk and grep can also be used equally well for your task; I've given you some tips which you don't appear to have tried yet... so I'll wait until you give them a go. Let me know if you get stuck and have any specific questions.

fpmurphy · December 16, 2008, 10:04pm

$ grep "^<w " file | sed 's/\(\<w c5=\"\)\(...\)\(.*>\)\(.*\)\(\<\/w\>\)/\2 \4/g'
NN1 FACTSHEET
DTQ WHAT
VBZ IS
NN1 AIDS
NN1 AIDS
VVN Acquired
AJ0 Immune
NN1 Deficiency
NN1 Syndrome
VBZ is
AT0 a
NN1 condition
VVN caused
PRP by
AT0 a