Extracting data from an .xml file

Hello,
I have copied .xml code for a single item below. I am trying to extract four items (RecordReference, TitleText, PriceAmount, and DiscountCode), so the desired output would be:

Desired output:

9780026106504 THE MACMILLAN DICT OF POLITICAL QUOT 93 42.97 K04

There would be multiple items concatenated in the .xml data file.

Thanks!!

<!DOCTYPE ONIXMessage SYSTEM
"http://www.editeur.org/onix/2.1/reference/onix-international.dtd">
<ONIXMessage>
<Header>
<FromSAN>9999999</FromSAN>
<FromCompany>Savvas Learning Company</FromCompany>
<FromPerson>SAVVAS ONIX Support</FromPerson>
<FromEmail>savvas_PT-USDatamart-K12-Support@savvas.com</FromEmail>
<SentDate>20240622</SentDate>
<MessageNote>Savvas_ONIX_K12_20240622.xml</MessageNote>
<DefaultLanguageOfText>eng</DefaultLanguageOfText>
<DefaultPriceTypeCode>01</DefaultPriceTypeCode>
<DefaultCurrencyCode>USD</DefaultCurrencyCode>
<DefaultClassOfTrade>GEN</DefaultClassOfTrade>
</Header>
<Product>
<RecordReference>9780026106504</RecordReference>
<NotificationType>03</NotificationType>
<RecordSourceType>04</RecordSourceType>
<ProductIdentifier>
<ProductIDType>15</ProductIDType>
<IDTypeName>ISBN-13</IDTypeName>
<IDValue>9780026106504</IDValue>
</ProductIdentifier>
<ProductIdentifier>
<ProductIDType>14</ProductIDType>
<IDTypeName>GTIN-14</IDTypeName>
<IDValue>XXXXXXXXXXXXXX</IDValue>
</ProductIdentifier>
<ProductIdentifier>
<ProductIDType>02</ProductIDType>
<IDTypeName>ISBN-10</IDTypeName>
<IDValue>0026106507</IDValue>
</ProductIdentifier>
<ProductForm>BB</ProductForm>
<NoSeries/>
<Title>
<TitleType>01</TitleType>
<TitleText>THE MACMILLAN DICT OF POLITICAL QUOT 93</TitleText>
</Title>
<Title>
<TitleType>05</TitleType>
<TitleText>THE MACMILLAN DICT OF POLITICAL QUOT 93</TitleText>
</Title>
<Contributor>
<SequenceNumber>1</SequenceNumber>
<ContributorRole>A01</ContributorRole>
</Contributor>
<NoEdition/>
<Language>
<LanguageRole>01</LanguageRole>
<LanguageCode>eng</LanguageCode></Language>
<BASICMainSubject>EDU025000</BASICMainSubject>
<AudienceRange>
<AudienceRangeQualifier>11</AudienceRangeQualifier>
<AudienceRangePrecision>03</AudienceRangePrecision>
<AudienceRangeValue>06</AudienceRangeValue>
<AudienceRangePrecision>04</AudienceRangePrecision>
<AudienceRangeValue>12</AudienceRangeValue>
</AudienceRange>
<Imprint>
<NameCodeType>02</NameCodeType>
<NameCodeValue>927</NameCodeValue>
</Imprint>
<Publisher>
<PublishingRole>01</PublishingRole>
<PublisherName>Savvas</PublisherName>
</Publisher>
<PublishingStatus>07</PublishingStatus>
<PublicationDate>19930610</PublicationDate>
<Measure>
<MeasureTypeCode>01</MeasureTypeCode>
<Measurement>9.625</Measurement>
<MeasureUnitCode>in</MeasureUnitCode>
</Measure>
<Measure>
<MeasureTypeCode>02</MeasureTypeCode>
<Measurement>8</Measurement>
<MeasureUnitCode>in</MeasureUnitCode>
</Measure>
<Measure>
<MeasureTypeCode>03</MeasureTypeCode>
<Measurement>1.75</Measurement>
<MeasureUnitCode>in</MeasureUnitCode>
</Measure>
<Measure>
<MeasureTypeCode>08</MeasureTypeCode>
<Measurement>0</Measurement>
<MeasureUnitCode>lb</MeasureUnitCode>
</Measure>
<OutOfPrintDate>19980610</OutOfPrintDate>
<SupplyDetail>
<SupplierName>Savvas</SupplierName>
<AvailabilityCode>OP </AvailabilityCode>
<PackQuantity>10</PackQuantity>
<Price>
<PriceTypeCode>05</PriceTypeCode>
<DiscountCoded>
<DiscountCodeType>02</DiscountCodeType>
<DiscountCode>K04</DiscountCode>
</DiscountCoded>
<PriceAmount>42.97</PriceAmount>
<CurrencyCode>USD</CurrencyCode>
</Price>
</SupplyDetail>
</Product>
</ONIXMessage>

your post was not quite valid - missing '' to close the data, i've added

try the following ( i use a comma to separate the fields , you can adjust as needed )

xmllint --xpath 'concat(//RecordReference, ",", //Title[TitleType="01"]/TitleText, ",", //PriceAmount, ",", //DiscountCode)' my.xml
9780026106504,THE MACMILLAN DICT OF POLITICAL QUOT 93,42.97,K04

NB: You've given a single entity, hence a single result, this has not be tested against input
with many entries ...

2 Likes

Thanks so much. It works great. Do you know how it could be modified for multiple entries in a long .xml file? Each line output would correspond to an item in the .xml file from <Product> to </Product>

@palex ,

  • give a worked example - include the input , and the expected result(s)
  • try some experimenting - using the example provided as a starting point.

tks

I've expanded the input file below to include two items. Running the xmllint command is processing the first line only. I'm guessing that there may be a second tag to use in addition to --xpath, or perhaps another addition at the end of the command to initiate the continuation until the end of file, though I haven't been able to identify it through searches.

bash-3.2$ xmllint --xpath 'concat(//RecordReference, ",", //Title[TitleType="01"]/TitleText, ",",//PriceAmount, ",",//DiscountCode)' z.xml 
9780026106504,THE MACMILLAN DICT OF POLITICAL QUOT 93,42.97,K04 

The desired output (for the .xml file below) would be:

9780026106504,THE MACMILLAN DICT OF POLITICAL QUOT 93,42.97,K04
9780028603087,HOW TO WRITE A RESEARCH PAPER 95,8.47,K04

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE ONIXMessage SYSTEM
"http://www.editeur.org/onix/2.1/reference/onix-international.dtd">
<ONIXMessage>
<Header>
<FromSAN>9999999</FromSAN>
<FromCompany>Savvas Learning Company</FromCompany>
<FromPerson>SAVVAS ONIX Support</FromPerson>
<FromEmail>savvas_PT-USDatamart-K12-Support@savvas.com</FromEmail>
<SentDate>20240622</SentDate>
<MessageNote>Savvas_ONIX_K12_20240622.xml</MessageNote>
<DefaultLanguageOfText>eng</DefaultLanguageOfText>
<DefaultPriceTypeCode>01</DefaultPriceTypeCode>
<DefaultCurrencyCode>USD</DefaultCurrencyCode>
<DefaultClassOfTrade>GEN</DefaultClassOfTrade>
</Header>
<Product>
<RecordReference>9780026106504</RecordReference>
<NotificationType>03</NotificationType>
<RecordSourceType>04</RecordSourceType>
<ProductIdentifier>
<ProductIDType>15</ProductIDType>
<IDTypeName>ISBN-13</IDTypeName>
<IDValue>9780026106504</IDValue>
</ProductIdentifier>
<ProductIdentifier>
<ProductIDType>14</ProductIDType>
<IDTypeName>GTIN-14</IDTypeName>
<IDValue>XXXXXXXXXXXXXX</IDValue>
</ProductIdentifier>
<ProductIdentifier>
<ProductIDType>02</ProductIDType>
<IDTypeName>ISBN-10</IDTypeName>
<IDValue>0026106507</IDValue>
</ProductIdentifier>
<ProductForm>BB</ProductForm>
<NoSeries/>
<Title>
<TitleType>01</TitleType>
<TitleText>THE MACMILLAN DICT OF POLITICAL QUOT 93</TitleText>
</Title>
<Title>
<TitleType>05</TitleType>
<TitleText>THE MACMILLAN DICT OF POLITICAL QUOT 93</TitleText>
</Title>
<Contributor>
<SequenceNumber>1</SequenceNumber>
<ContributorRole>A01</ContributorRole>
</Contributor>
<NoEdition/>
<Language>
<LanguageRole>01</LanguageRole>
<LanguageCode>eng</LanguageCode>
</Language>
<BASICMainSubject>EDU025000</BASICMainSubject>
<AudienceRange>
<AudienceRangeQualifier>11</AudienceRangeQualifier>
<AudienceRangePrecision>03</AudienceRangePrecision>
<AudienceRangeValue>06</AudienceRangeValue>
<AudienceRangePrecision>04</AudienceRangePrecision>
<AudienceRangeValue>12</AudienceRangeValue>
</AudienceRange>
<Imprint>
<NameCodeType>02</NameCodeType>
<NameCodeValue>927</NameCodeValue>
</Imprint>
<Publisher>
<PublishingRole>01</PublishingRole>
<PublisherName>Savvas</PublisherName>
</Publisher>
<PublishingStatus>07</PublishingStatus>
<PublicationDate>19930610</PublicationDate>
<Measure>
<MeasureTypeCode>01</MeasureTypeCode>
<Measurement>9.625</Measurement>
<MeasureUnitCode>in</MeasureUnitCode>
</Measure>
<Measure>
<MeasureTypeCode>02</MeasureTypeCode>
<Measurement>8</Measurement>
<MeasureUnitCode>in</MeasureUnitCode>
</Measure>
<Measure>
<MeasureTypeCode>03</MeasureTypeCode>
<Measurement>1.75</Measurement>
<MeasureUnitCode>in</MeasureUnitCode>
</Measure>
<Measure>
<MeasureTypeCode>08</MeasureTypeCode>
<Measurement>0</Measurement>
<MeasureUnitCode>lb</MeasureUnitCode>
</Measure>
<OutOfPrintDate>19980610</OutOfPrintDate>
<SupplyDetail>
<SupplierName>Savvas</SupplierName>
<AvailabilityCode>OP </AvailabilityCode>
<PackQuantity>10</PackQuantity>
<Price>
<PriceTypeCode>05</PriceTypeCode>
<DiscountCoded>
<DiscountCodeType>02</DiscountCodeType>
<DiscountCode>K04</DiscountCode>
</DiscountCoded>
<PriceAmount>42.97</PriceAmount>
<CurrencyCode>USD</CurrencyCode>
</Price>
</SupplyDetail>
</Product>
<Product>
<RecordReference>9780028603087</RecordReference>
<NotificationType>03</NotificationType>
<RecordSourceType>04</RecordSourceType>
<ProductIdentifier>
<ProductIDType>15</ProductIDType>
<IDTypeName>ISBN-13</IDTypeName>
<IDValue>9780028603087</IDValue>
</ProductIdentifier>
<ProductIdentifier>
<ProductIDType>14</ProductIDType>
<IDTypeName>GTIN-14</IDTypeName>
<IDValue>XXXXXXXXXXXXXX</IDValue>
</ProductIdentifier>
<ProductIdentifier>
<ProductIDType>02</ProductIDType>
<IDTypeName>ISBN-10</IDTypeName>
<IDValue>0028603087</IDValue>
</ProductIdentifier>
<ProductForm>BB</ProductForm>
<NoSeries/>
<Title>
<TitleType>01</TitleType>
<TitleText>HOW TO WRITE A RESEARCH PAPER 95C</TitleText>
</Title>
<Title>
<TitleType>05</TitleType>
<TitleText>HOW TO WRITE A RESEARCH PAPER 95C</TitleText>
</Title>
<Contributor>
<SequenceNumber>1</SequenceNumber>
<ContributorRole>A01</ContributorRole>
</Contributor>
<NoEdition/>
<Language>
<LanguageRole>01</LanguageRole>
<LanguageCode>eng</LanguageCode>
</Language>
<BASICMainSubject>EDU025000</BASICMainSubject>
<AudienceRange>
<AudienceRangeQualifier>11</AudienceRangeQualifier>
<AudienceRangePrecision>03</AudienceRangePrecision>
<AudienceRangeValue>06</AudienceRangeValue>
<AudienceRangePrecision>04</AudienceRangePrecision>
<AudienceRangeValue>12</AudienceRangeValue>
</AudienceRange>
<Imprint>
<NameCodeType>02</NameCodeType>
<NameCodeValue>927</NameCodeValue>
</Imprint>
<Publisher>
<PublishingRole>01</PublishingRole>
<PublisherName>Savvas</PublisherName>
</Publisher>
<PublishingStatus>07</PublishingStatus>
<PublicationDate>19951001</PublicationDate>
<Measure>
<MeasureTypeCode>01</MeasureTypeCode>
<Measurement>0</Measurement>
<MeasureUnitCode>in</MeasureUnitCode>
</Measure>
<Measure>
<MeasureTypeCode>02</MeasureTypeCode>
<Measurement>0</Measurement>
<MeasureUnitCode>in</MeasureUnitCode>
</Measure>
<Measure>
<MeasureTypeCode>03</MeasureTypeCode>
<Measurement>0</Measurement>
<MeasureUnitCode>in</MeasureUnitCode>
</Measure>
<Measure>
<MeasureTypeCode>08</MeasureTypeCode>
<Measurement>0</Measurement>
<MeasureUnitCode>lb</MeasureUnitCode>
</Measure>
<OutOfPrintDate>19980305</OutOfPrintDate>
<SupplyDetail>
<SupplierName>Savvas</SupplierName>
<AvailabilityCode>OP </AvailabilityCode>
<PackQuantity>82</PackQuantity>
<Price>
<PriceTypeCode>05</PriceTypeCode>
<DiscountCoded>
<DiscountCodeType>02</DiscountCodeType>
<DiscountCode>K04</DiscountCode>
</DiscountCoded>
<PriceAmount>8.47</PriceAmount>
<CurrencyCode>USD</CurrencyCode>
</Price>
</SupplyDetail>
</Product>
</ONIXMessage>

@palex, will take a look when I have time.
have you tried anything ? (share - regardless of end result please)
additionally, check out the xmlstarlet , xidel utilities for querying xml

1 Like

Unfortunately in xpath the string() and concat() functions are limited to the first match.
xmlstarlet can do more than xmllint ... so this is the path to a correct solution.

The following is an awk "solution" that mostly recognizes the xml structure but also relies on the given line structure.

awk -F'[<>]' '$2=="/Product"{prod=0; print rref,ttext,price,disc} prod {if($2=="RecordReference")rref=$3; else if($2=="PriceAmount")price=$3; else if($2=="DiscountCode")disc=$3; else if($2=="Title")t=1; else if(t && $2=="TitleType" && $3=="01")tt=1; else if(tt && $2=="TitleText"){ttext=$3} else if($2=="/Title")t=tt=0} $2=="Product" {prod=1}' OFS="," my.xml
1 Like

MadeInGermany - That works perfectly. Much appreciated!!

@palex , see if the below fits your requirement. Investing the time in xml toolset is worth the effort in the long term (even though they can be a pain to grapple with).

xidel -e " //Product/concat( RecordReference, ',', Title[TitleType='01']/TitleText, ',', .//PriceAmount, ',', .//DiscountCode) " -s palex.xml
9780026106504,THE MACMILLAN DICT OF POLITICAL QUOT 93,42.97,K04
9780028603087,HOW TO WRITE A RESEARCH PAPER 95C,8.47,K04
1 Like