Xml to csv (again)

palex · May 22, 2019, 2:49pm

Hello,
I have copied .xml code for a single item below. I am trying to extract three items (field indices*b244 (second occurrence), b203, and j151), so the desired output would be:

9780323013543	Manual of Natural Veterinary Medicine: Science and Tradition, 1e	68.95

A parallel solution, based upon a similar question I had made, would be of the following form, but this is not working. A solution of this form would be ideal:

awk �/b244|b203|j151/ {L=/a001/; gsub (/ *<[^>]*> */, _); printf "%s%s", L?TRS:"\t", $0; TRS=ORS} END {printf RS}' file

Here is the .xml data code:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/03/short/onix-international.dtd">
<ONIXmessage>
	<header>
		<m174>Elsevier Health Sciences</m174>
		<m175>Tony Cardinale</m175>
		<m182>20190405</m182>
	</header>
	<product>
		<a001>1084320</a001>
		<a002>03</a002>
		<productidentifier>
			<b221>02</b221>
			<b244>0323013546</b244>
		</productidentifier>
		<productidentifier>
			<b221>03</b221>
			<b244>9780323013543</b244>
		</productidentifier>
		<productidentifier>
			<b221>15</b221>
			<b244>9780323013543</b244>
		</productidentifier>
		<b246>03</b246>
		<b012>BC</b012>
		<title>
			<b202>01</b202>
			<b203>Manual of Natural Veterinary Medicine: Science and Tradition, 1e</b203>
			<b029>Science and Tradition</b029>
		</title>
		<contributor>
			<b035>A01</b035>
			<b039>Susan G.</b039>
			<b040>Wynn</b040>
			<b042>DVM</b042>
			<professionalaffiliation>
				<b046>Wynn Clinic for Therapeutic Alternatives, Marietta, GA</b046>
			</professionalaffiliation>
		</contributor>
		<contributor>
			<b035>A01</b035>
			<b039>Steve</b039>
			<b040>Marsden</b040>
			<b042>DVM, ND, MSOM, LAc, Dipl C.H.</b042>
			<professionalaffiliation>
				<b046>Co-founder, Edmonton Holistic Veterinary Clinic, The Natural Path Clinic, Edmonton, Alberta; Instructor, American Association of Veterinary Acupuncture, International Veterinary Acupuncture Society, and Academy of Veterinary Acupuncturists of Canada</b046>
			</professionalaffiliation>
		</contributor>
		<language>
			<b253>01</b253>
			<b252>eng</b252>
		</language>
		<b061>768</b061>
		<b064>MED089000</b064>
		<subject>
			<b067>24</b067>
			<b171>Landing</b171>
			<b069>5</b069>
			<b070>Veterinary Medicine</b070>
		</subject>
		<subject>
			<b067>24</b067>
			<b171>MSC</b171>
			<b069>079</b069>
			<b070>Veterinary - General</b070>
		</subject>
		<b073>06</b073>
		<othertext>
			<d102>01</d102>
			<d104>This handy reference provides users with an understanding of complementary and alternative treatment options for more than 130 common disease states. A practical manual, it describes a variety of possible approaches to small animal disorders. Concentrating on nutrition, herbs, traditional Chinese medicines, and physical therapies, the authors present both tradition- and evidence-based therapies for disorders not always responsive to conventional therapies. Each monograph-style discussion of natural therapies for disorders common to specific body systems presents therapeutic rationales with the goals of treatment, alternative therapies with conventional bases, paradigmatic options, and authors' suggestions from which they've experienced success. Key references are also included at the conclusion of each chapter.<ul><li>Presents new and alternative therapies with scientific support, encouraging veterinarians explore new therapies with confidence.</li><li>Helps veterinarians develop treatment plans - a vast improvement over large texts that simply introduce the therapies.</li><li>Clearly explains esoteric concepts of traditional Chinese medicine in updated language.</li><li>Practical, user-friendly pocket manual format allows for quick access in the clinical setting.</li><li>Chapters are organized logically by body system and disorders are alphabetized within each chapter.</li><li>Each body system chapter includes a case report that describes the history, physical examination, assessment, treatment, and outcome of a specific patient to further illustrate how to develop a treatment plan.</li><li>Each appendix offers practical backup for designing treatment plans, from homemade diets and Chinese food therapy to oral herb doses and a valuable herb cross-reference table.</li></ul></d104>
		</othertext>
		<othertext>
			<d102>04</d102>
			<d104>Part One	Fundamentals of Chinese Medicine<br>1. The meat and potatoes of Chinese medicine: the cooking pot analogy<br>2. Chinese medicine as a basis for an alternative medical approach<br>Part Two Clinical Strategies by Organ System<br>3. Therapies for behavior disorders<br>4. Therapies for cardiovascular disorders<br>5. Therapies for dermatologic disorders<br>6. Therapies for digestive disorders<br>7. Therapies for ear diseases<br>8. Therapies for endocrine disorders<br>9. Therapies for hematologic and immunologic disorders<br>10. Therapies for infectious diseases<br>11. Therapies for liver diseases <br>12. Therapies for musculoskeletal disorders<br>13. Therapies for neoplastic disorders<br>14. Therapies for neurologic disorders<br>15. Therapies for ophthalmologic disorders<br>16. Therapies for respiratory disorders<br>17. Therapies for reproductive disorders<br>18. Therapies for urologic disorders<br>Appendix A 	Guidelines for Homemade Diets<br>Appendix B 	Chinese Food Therapy <br>Appendix C 	Suggested Oral Herb Doses<br>Appendix D 	Chinese Herb Cross Reference Table<br>Appendix E 	Acupuncture Points</d104>
		</othertext>
		<imprint>
			<b241>02</b241>
			<b243>Mosby17</b243>
			<b079>Mosby</b079>
		</imprint>
		<b081>Elsevier Health Sciences</b081>
		<b394>04</b394>
		<b003>20021015</b003>
		<copyrightstatement>
			<b087>2003</b087>
			<copyrightowner>
				<copyrightowneridentifier>
					<b392>02</b392>
					<b244>Elsevier Health Sciences</b244>
				</copyrightowneridentifier>
			</copyrightowner>
		</copyrightstatement>
		<salesrights>
			<b089>01</b089>
			<b388>WORLD</b388>
		</salesrights>
		<measure>
			<c093>01</c093>
			<c094>7.32</c094>
			<c095>in</c095>
		</measure>
		<measure>
			<c093>02</c093>
			<c094>4.84</c094>
			<c095>in</c095>
		</measure>
		<relatedproduct>
			<h208>06</h208>
			<productidentifier>
				<b221>03</b221>
				<b244>9780323070096</b244>
			</productidentifier>
		</relatedproduct>
		<relatedproduct>
			<h208>12</h208>
			<productidentifier>
				<b221>02</b221>
				<b244>0323029981</b244>
			</productidentifier>
			<productidentifier>
				<b221>03</b221>
				<b244>9780323029988</b244>
			</productidentifier>
		</relatedproduct>
		<relatedproduct>
			<h208>12</h208>
			<productidentifier>
				<b221>03</b221>
				<b244>9780702047435</b244>
			</productidentifier>
		</relatedproduct>
		<supplydetail>
			<j137>Elsevier Health Sciences</j137>
			<j397>WORLD</j397>
			<j141>MD</j141>
			<j396>23</j396>
			<j145>20</j145>
			<price>
				<j148>01</j148>
				<discountcoded>
					<j363>02</j363>
					<j378>Elsevier Health Sciences</j378>
					<j364>REF</j364>
				</discountcoded>
				<j151>68.95</j151>
				<j152>USD</j152>
			</price>
		</supplydetail>
	</product>

nezabudka · May 22, 2019, 4:56pm

Hi, I don't understand what is happening, I just tried to correct some mistakes.

awk '/b244|b203|j151/ {L=/a001/; gsub (/ *<[^>]*> */, _); printf "%s%s", L? rs:"\t", $0; rs=RS } END {printf RS}' file

--- Post updated at 23:22 ---

awk -F"<[^>]*>" '/b244|b203|j151/ {print $2}' file

--- Post updated at 23:36 ---

Then further just need an explanation on what principle should be selected line with tag <b244>?
Temporarily remove the field separator -F

awk '/b244|b203|j151/ {print $0}' file.xml
			<b244>0323013546</b244>
			<b244>9780323013543</b244>
			<b244>9780323013543</b244>
			<b203>Manual of Natural Veterinary Medicine: Science and Tradition, 1e</b203>
					<b244>Elsevier Health Sciences</b244>
				<b244>9780323070096</b244>
				<b244>0323029981</b244>
				<b244>9780323029988</b244>
				<b244>9780702047435</b244>
				<j151>68.95</j151>

--- Post updated at 23:56 ---

@palex, sorry, I was not attentive

awk -F"<[^>]*>" '/b244/ {cn++; if(cn == 2) printf $2}; /b203|j151/ {printf "\t" $2} END {print ""}' file

nezabudka · May 23, 2019, 1:51am

sed -n '/b244/! b; x;s/$/\n/; /^\n\n$/! {x;b}; :1; ${x;s/[ \t]*<[^>]*>//g;s/\n/\t/gp}; n; /b203\|j151/ H;b1' file

joker · May 23, 2019, 4:36am

Hi,

here's a solution with xmlstarlet, not shorter but a bit more data-structure-oriented approach, which I find a lot easier to understand:

xmlstarlet sel -t -m "//product" \ 
    -v "concat(current()/productidentifier[2]/b244,' ',current()/title/b203,' ',current()/supplydetail/price/j151)" -n  data.xml

Output:

9780323013543 Manual of Natural Veterinary Medicine: Science and Tradition, 1e 68.95

Note: I added the closing XML-Tag "</ONIXmessage>" at the end. Without this my solution won't work. I assume it is there in the original file, even if your sample data does not contain it.

Hmm. How do I get a tab character in the command?

The following works only within bash:

xmlstarlet sel -t -m "//product"     \ 
   -v "concat(current()/productidentifier[2]/b244,'"$'\t'"',current()/title/b203,'"$'\t'"',current()/supplydetail/price/j151)" -n  data.xml

For a posix shell, I know no other way than this:

tab="$(echo -ne "\t")"
xmlstarlet sel -t -m "//product"     \ 
   -v "concat(current()/productidentifier[2]/b244,'$tab',current()/title/b203,'$tab',current()/supplydetail/price/j151)" -n  data.xml

As an alternative you can type a literal TAB at the command line. Here at my linux box, I get a literal tab on the command line with pressing Ctrl + V and then TAB .

palex · May 23, 2019, 9:20pm

That is great.... thanks so much! The only issue is that each line of the output file runs together. How can I add a line separator?

joker · May 23, 2019, 10:37pm

The -n should be responsible for inserting newlines.

palex · May 23, 2019, 10:45pm

My apologies, I was referring to the previous solution:

awk -F"<[^>]*>" '/b244/ {cn++; if(cn == 2) printf $2}; /b203|j151/ {printf "\t" $2} END {print ""}' file

How can I separate the output lines?

Thanks again!

nezabudka · May 24, 2019, 12:56am

awk -F"<[^>]*>" '/b244/ {cn++; if(cn == 2) print $2}; /b203|j151/ {print $2}' file

--- Post updated at 07:56 ---

sed -n '/b203\|j151/ b1; /b244/! b; x;s/$/\n/; /^\n\n$/! {x;b};x; :1;s/[ \t]*<[^>]*>//gp' file

bash 4.4 x86_64-redhat-linux-gnu

palex · May 24, 2019, 1:00am

The revised awk code now gives a line break after each field, instead of after each line. The sed code is giving a blank output file.

nezabudka · May 24, 2019, 9:19am

I can not even imagine. Let's start with the base

echo $SHELL $HOSTTYPE

palex · May 24, 2019, 9:23am

bash-3.2$ echo $SHELL $HOSTTYPE
/bin/bash x86_64

nezabudka · May 24, 2019, 9:27am

I found in your posts

I'm novice here

palex · May 24, 2019, 1:35pm

Thanks so much for your help. Here is the output I have been getting:

bash-3.2$ awk -F"<[^>]*>" '/b244/ {cn++; if(cn ==2) print $2}; /b203|j151/ {print $2}' hsel*.xml | head 
9780323013543
Manual of Natural Veterinary Medicine: Science and Tradition, 1e
68.95
Nursing Care of the Critically Ill Child, 3e
97.95
Practical Diagnostic Imaging for the Veterinary Technician, 3e
82.95
Core Curriculum for Primary Care Pediatric Nurse Practitioners, 1e
90.95
Core Review for Primary Care Pediatric Nurse Practitioners, 1e

Desired output:

9780323013543	Manual of Natural Veterinary Medicine: Science and Tradition, 1e	68.95
9780323020404	Nursing Care of the Critically Ill Child, 3e	97.95
9780323025751	Practical Diagnostic Imaging for the Veterinary Technician, 3e	82.95
9780323027564	Core Curriculum for Primary Care Pediatric Nurse Practitioners, 1e	90.95

nezabudka · May 24, 2019, 2:48pm

I understood

awk -F"<[^>]*>" '/b244/ {cn++; if(cn == 2) printf $2}; /b203/ {printf "\t" $2}; /j151/ {cn=0; print "\t" $2}' file

--- Post updated at 21:48 ---

sed -n '
/b244/! b; x;s/$/\n/
/^\n\n$/! {x;b}
:1;n; /b203/ H
/j151/! b1; H;s/.*//;x;s/[ \t]*<[^>]*>//g;s/\n/\t/gp
' file

sed - may be for a little-endian architecture only ( I don't know. That is my guess )

palex · May 24, 2019, 2:49pm

The awk line works perfectly now. Thank you so much for your help!!!