How to: Validate a CSV file using an XSD?

Hi All,

I was wondering if there is a utility/perl library/way of validating the contents of a CSV file using an XSD.

i.e. Customer csv (including Header & Trailer)

Cust_num, Cust_nme, Cust_typ, Cust_act_dte, Cust_loc,
101,Joe's Pizza,Retail,10121979,Detroit,
102,Sony Corp,Commercial,10101946,Tokyo,
103,K-CO Foods,Wholesale,01041987,London,
3

The XSD to validate against would contain:

<xs:element name="Cust_num," type="xs:positiveInteger"/>
<xs:element name="Cust_nme" type="xs:string" minOccurs="0"/>
<xs:element name="Cust_typ" type="xs:string" minOccurs="0"/>
<xs:element name="Cust_act_dte" type="xs:date" />

Please let us have your thoughts?

Many Thanks

Luinzi

DSDL looks like a match for this sort of thing. https://forge.gridforum.org/sf/projects/dfdl-wg

1 Like

An XML schema describes the allowed structure and types for an XML document, not a CSV file. The best you can do is create an XML document from the CSV file contents and validate that document against the XML schema.

CSV does have some rules that might support a validation.

  1. Lines should be cr-lf, but for most of us, that is not critical.
  2. You cannot leave double quotes open at end of line, or is that a line break embedded in a field? Probably, the user should have to allow that.
  3. It'd be nice if the field count of every line was identical (quoting problems, embedded linefeeds?), but that is somewhat a warning. The user should be able to provide an expected range of fields, or indicate if the field count is not always identical in the data set.
  4. You need to be double-quote sensitive when counting fields, it is not simple comma delimited text.
  5. You need to be doubled-double-quote sensitive when evaluating double quotes.

Yeah, I always see XSD in the context of parsers like xerces having XML checking against XSD, although for speed I did my own checks. I was not sure if XSD was extensible to CSV format. XML is a slippery world, in terms of what you can do! :smiley:

1 Like

Actually, fp, XSD's can define formats for other than XML. This company received a number of XSD's for ANSI X.12 files from Microsoft with our Biztalk software. (Specifically in my case, using an XSD for X.12 HIPAA 5010 837P claims files).

I looked at this thread because we have reason to want to validate the X.12 files against the XSD from code rather than through Biztalk in one instance

Regards

I believe that Biztalk builds an XML representation of the X.12 message before validating it using the appropriate XSD.

Certainly, for generic CSV to XML, the records can be in an XML element, and the fields can be XML strings, field-numbered by attribute within each element. Generally, a CSV record's position in the file is not significant, but records could be numbered, too, so that order is available.

However, the CSV needs to be CSV-Valid for conversion to anything.

  • I can imagine XML that describes the CSV rules, and
  • if I imagine a little harder, I can imagine XML to describe a specific CSV file so they validity of text to other type can be checked, as well as the validity of field counts, and to annotate columns with official labels.
  • But how far does it go? If I imagine really hard, I imagine XSD can do all this, but it feels a bit of a reach, since XSD exists to validate XML, not other file structures. I think it sounds like a nice accessory, but not a subset of XSD, conceptually. You could even write a reverse XML compiler to sniff files and suggest structure, create XML, validate to that XML, interactively, like Excel text to columns but smarter, more tightly typed.