blog.humaneguitarist.org

a faulty PREMIS

[Thu, 26 Apr 2018 17:12:39 +0000]
A few months ago I was working on some METS and PREMIS 3 stuff for my current gig. While creating sample METS files containing PREMIS, I saw a discrepency between the PREMIS 3 Data Dictionary [https://www.loc.gov/standards/premis/v3/premis-3-0-datadictionary-only.pdf] and the PREMIS 3 XSD [https://github.com/LibraryOfCongress/premis-v3-0/blob/master/xsd] in terms of how to define a PREMIS "object" element as either a File or Representation, etc. The Data Dictionary says to use an "objectCategory" element within the "object" element and that it is mandatory: "The only mandatory semantic units that apply to all categories of Object (Intellectual Entity, Representation, File, and Bitstream) are objectIdentifier and objectCategory." The XSD acknowledges the difference but does not include the "objectCategory" element: In other words, as expected the XSD is concerned with providing a way to programmatically validate a PREMIS document. Per the XSD, the way to do this is to use an "xsi:type" attribute in the root "object" element a la: <premis:object xsi:type="premis:representation"> So far, I see one problem. Namely that the Data Dictionary says to do something one way and the XSD another. However, in using command-line and Python validators based on libxml2 [http://xmlsoft.org], which is very common, I can't get a METS document to validate if I notate the PREMIS "object" element per the XSD. In other words, this valid (per the PREMIS XSD) snippet will invalidate a METS file using a libxml2-based validator. Example 1 <premis:premis xmlns:premis="http://www.loc.gov/premis/v3" version="3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd"> <premis:object xsi:type="premis:representation"> <!-- <premis:object> --> <premis:objectIdentifier> <premis:objectIdentifierType valueURI="http://id.loc.gov/vocabulary/identifiers/local">local</premis:objectIdentifierType> <premis:objectIdentifierValue>1234</premis:objectIdentifierValue> </premis:objectIdentifier> <!-- <premis:objectCategory>representation</premis:objectCategory> --> <premis:significantProperties> <premis:significantPropertiesValue>foo bar</premis:significantPropertiesValue> </premis:significantProperties> </premis:object> <!-- Valid as a standalone file against the PREMIS XSD. --> <!-- Does not comply with Data Dictionary. --> <!-- Will invalidate a METS file with libxml2: Validation of current file using XML schema: ERROR: Element '{http://www.loc.gov/premis/v3}object', attribute '{http://www.w3.org/2001/XMLSchema-instance}type': The QName value '{http://www.loc.gov/premis/v3}representation' of the xsi:type attribute does not resolve to a type definition. --> </premis:premis> And this snippet in Example 2 will not invalidate the METS, although it will not validate directly against the PREMIS XSD. Example 2 <premis:premis xmlns:premis="http://www.loc.gov/premis/v3" version="3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd"> <!-- <premis:object xsi:type="premis:representation"> --> <premis:object> <premis:objectIdentifier> <premis:objectIdentifierType valueURI="http://id.loc.gov/vocabulary/identifiers/local">local</premis:objectIdentifierType> <premis:objectIdentifierValue>1234</premis:objectIdentifierValue> </premis:objectIdentifier> <premis:objectCategory>representation</premis:objectCategory> <premis:significantProperties> <premis:significantPropertiesValue>foo bar</premis:significantPropertiesValue> </premis:significantProperties> </premis:object> <!-- Invalid as a standalone file against the PREMIS XSD: ERROR: Element '{http://www.loc.gov/premis/v3}object': The type definition is abstract. --> <!-- Complies with Data Dictionary. --> <!-- Will not invalidate a METS file with libxml2. --> </premis:premis> At this point, there's a second problem. Namely, I have to use the Data Dictionary version, which is invalid per the PREMIS XSD, inside a METS file. When I reached out to the PREMIS group, the person who replied (very promptly from overseas) wasn't able to recreate the issue. He was using Oxygen XML I believe. So I downloaded Oxygen, Altova, and Stylus Studio demos. All of them were able to validate a METS file that embeds the snippet as in Example 1. But libxml2, as I understand, is pretty prevalent as the backbone for a lot of XML libraries. It's the basis for Python's lxml, which is what I'm using. It's been a while since I tested these snippets, but I'm pretty sure I also had validation issues using some Java-based command line tools (i.e. not libxml2). The responder also said that the XSD was an "endorsed (but not compulsory) expression(s)" of the Data Dictionary. So it sounds like the Data Dictionary is canonical, albeit without a way to be validated, given how the XSD deviates from it. Also note how the "valueUri" attribute I used in the examples isn't even mentioned in the Data Dictionary as far as I saw, but is notated in the XSD. That's a concern to me because it adds valuable information about the metadata values themselves, but one has to read the XSD to have learned about what attributes can be used. I think the bigger picture here is: 1. There shouldn't be a discrepancy between the Data Dictionary and the XSD re: notating the type of object. 1. Attributes should be mentioned in the Data Dictionary, too. 2. The method in the Data Dictionary re: how to notate the type of object isn't a good idea in my opinion. In other words, XSD aside, declaring the type of an element with the value of a child element seems odd to me when that value determines what the parent can actually contain. That's a little like having to walk inside a restaurant and look at the menu first before you can determine whether you've walked into a McDonald's or a Burger King. Hopefully, it's actually a Cook Out [http://www.cookout.com], but I digress. 3. The method in the XSD doesn't appear to be friendly to a large class of XML tools, which in turn may actually hinder some digital preservation efforts due to the inability to validate METS/PREMIS with said tools. Or, it could create the need for a custom scripting solution in order to validate the METS/PREMIS, which defeats the point of having XSDs. I don't see why it wouldn't simply be better to make the first required element inside a "premis:object" element be either "premis:file" or "premis:representation", etc. In other words: <premis:premis xmlns:premis="http://www.loc.gov/premis/v3" version="3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/premis/v3 http://www.loc.gov/standards/premis/v3/premis.xsd"> <premis:object> <premis:representation> <!-- New container. --> <premis:objectIdentifier> <premis:objectIdentifierType valueURI="http://id.loc.gov/vocabulary/identifiers/local">local</premis:objectIdentifierType> <premis:objectIdentifierValue>1234</premis:objectIdentifierValue> </premis:objectIdentifier> <premis:significantProperties> <premis:significantPropertiesValue>foo bar</premis:significantPropertiesValue> </premis:significantProperties> </premis:representation> </premis:object> </premis:premis> Alternately, "premis:object" could be a child of "premis:representation" in the example above - or it could just be omitted. Either way, I think this is more human-readable and will more easily validate across more tools as I think, in general, it may be an issue to validate an element inside METS according to its attribute value with or without an "xsi" namespace prefix - at least as far as libxml2 is concerned (I think because it doesn't support XSD 1.1). Look, I think PREMIS is really valuable - even though I couldn't resist the title of this post (it was right there for me!). And while I was glad to learn that there are some XML validation tools that did the job, they were proprietary and my take away is that preservation schema should probably consider the tools that will interpret them as part of the design process as well as the context (METS) in which they'll likely appear with regularity. ... Update, May 2, 2018: I just noticed in prior versions of this post, due to careless copy-paste on my part, I had not removed the "premis:objectCategory" element in my proposed alternative. I just fixed that. The "objectCategory" is useless when using the more specific "premis:representation" element.