Data extraction from .xml file

palex · May 10, 2016, 8:32pm

Hello,
I'm attempting to extract 13 digit numbers beginning with 978 from a data file with the following command:

awk '{ for(i=1;i<=NF;i++) if($i ~ /^978/) print $i; }' datafile > outfile

This typically works. However, the new data file is an .xml file, and this command is no longer working for this reason, I imagine.

How can I either modify this command or convert the file so that the command will function?

Thanks so much!

Don_Cragun · May 10, 2016, 8:41pm

Without a representative sample of the contents of datafile and a clear statement of where in the file 978 followed by ten other decimal digits is supposed to be matched, we can only make wild guesses at what might meet your requirements...

palex · May 10, 2016, 10:53pm

Sample from the .xml file:

<PriceAmount>42.97</PriceAmount>
<CurrencyCode>USD</CurrencyCode>
</Price>
</SupplyDetail>
</Product>
<Product>
<RecordReference>9780028608129</RecordReference>
<NotificationType>03</NotificationType>
<RecordSourceType>04</RecordSourceType>
<ProductIdentifier>
<ProductIDType>15</ProductIDType>
<IDTypeName>ISBN-13</IDTypeName>
<IDValue>9780028608129</IDValue>
</ProductIdentifier>
<ProductIdentifier>
<ProductIDType>14</ProductIDType>
<IDTypeName>GTIN-14</IDTypeName>

Desired output:

9780028608129
9780028608129

Thanks again!

Don_Cragun · May 10, 2016, 11:24pm

Are you only looking for values found between <RecordReference> tags and between <IDValue> tags, or are you looking for values between any kings of tags?

What operating system are you using?

Does the grep utility on your system have a -o option?

palex · May 10, 2016, 11:28pm

I wish to extract *all* such numbers (beginning with 978) from the file, irrespective of the tags.

Mac OS - El Capitan, XQuartz 2.7.8

Yes, it appears that grep has the -o option.

Thanks!

Don_Cragun · May 10, 2016, 11:39pm

Try:

grep -Eo '978[0-9]{10}' datafile

palex · May 10, 2016, 11:47pm

Perfect!