I have data like the following pattern:
<change date="2000-01-09" who="#OUCS">Updated all catrefs</change>
<change date="2000-01-08" who="#OUCS">Manually updated tagcounts, titlestmt, and title in source</change>
<change date="1999-09-13" who="#UCREL">POS codes revised for BNC-2; header updated</change>
<change date="1994-11-24" who="#dominic">Initial accession to corpus</change>
</revisionDesc>
</teiHeader>
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <wtext type="NONAC">
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <div level="1" n="1" type="leaflet">
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <head type="MAIN">
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <s n="1">
<w c5="NN1" hw="factsheet" pos="SUBST">FACTSHEET</w>
<w c5="DTQ" hw="what" pos="PRON">WHAT</w>
<w c5="VBZ" hw="be" pos="VERB">IS</w>
<w c5="NN1" hw="aids" pos="SUBST">AIDS</w>
<c c5="PUN">?</c>
</s>
</head>
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <p>
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <s n="2">
[-](file:///C:/Documents%20and%20Settings/Ivy/Desktop/A00.xml#) <hi rend="bo">
<w c5="NN1" hw="aids" pos="SUBST">AIDS</w>
<c c5="PUL">(</c>
<w c5="VVN-AJ0" hw="acquire" pos="VERB">Acquired</w>
<w c5="AJ0" hw="immune" pos="ADJ">Immune</w>
<w c5="NN1" hw="deficiency" pos="SUBST">Deficiency</w>
<w c5="NN1" hw="syndrome" pos="SUBST">Syndrome</w>
<c c5="PUR">)</c>
</hi>
<w c5="VBZ" hw="be" pos="VERB">is</w>
<w c5="AT0" hw="a" pos="ART">a</w>
<w c5="NN1" hw="condition" pos="SUBST">condition</w>
<w c5="VVN" hw="cause" pos="VERB">caused</w>
<w c5="PRP" hw="by" pos="PREP">by</w>
<w c5="AT0" hw="a" pos="ART">a</w>
Then in order extract those patterns like
<w c5="(.?)" hw="(.?)" pos="(.*?)">(.?)</w>.
First, I wirte the following command sed 's/<w c5="\(.?\)" hw="\(.?\)" pos="\(.*?\)">\(.?\)<\/w>/\1:\4/g' A00.xml.
However, the result is like this which is not what I want:
<s n="420"><w c5="NN1" hw="aids" pos="SUBST">AIDS </w><w c5="NN1-VVB" hw="care" pos="SUBST">Care </w><w c5="NN1" hw="education" pos="SUBST">Education </w><w c5="CJC" hw="and" pos="CONJ">and </w><w c5="NN1" hw="training" pos="SUBST">Training </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="AT0" hw="a" pos="ART">a </w><w c5="NN1" hw="company" pos="SUBST">company </w><w c5="VVN" hw="limit" pos="VERB">limited </w><w c5="PRP" hw="by" pos="PREP">by </w><w c5="NN1" hw="guarantee" pos="SUBST">guarantee</w><c c5="PUN">.</c></s>
Seem the replacement doesn't work.
I want the result like these for all those patterns <w c5="(.?)" hw="(.?)" pos="(.*?)">(.*?)</w>
NN1:FACTSHEET
DTQ:WHAT
VBZ:IS
NN1:AIDS
Second, I try awk '/<w c5="(.?)" hw="(.?)" pos="(.*?)">(.*?)<\/w>/ {print $1,$2,$3,$4}' A00.xml. However, the result is not what I want. They didn't print out those parts within ().
How can we just extract and grep those parts within () which is used to defined the parts I need to extract?
Thanks all of your suggestion
John