blog.humaneguitarist.org

METS/PREMIS with Jinja templates

[Thu, 01 Aug 2019 14:35:38 +0000]
As I mentioned in this post [http://blog.humaneguitarist.org/2018/04/26/a-faulty-premis/] over a year ago, I did some METS/PREMIS stuff for my last gig. For the project, there were maybe 4-5 modules that took inputs and created outputs re: email processing. The people I worked for also wanted some sort of METS/PREMIS output to record events. They didn't really provide a lot of detail other than wanting their local rights info to be included in the data. I told them that anything specific to their local needs (i.e. the rights) could not be embedded into the code. The project was for a grant that was supposed to serve other institutions like them and anything specific to them would interfere with that outcome. But I did need a way to get them what they wanted while not forcing it upon anyone else. I decided to: 1. Create a custom logger that would output the data needed to populate PREMIS metadata. 2. Create code that could read the log data and convert it to an object. 3. Feed that object to code that wrote METS/PREMIS XML using Jinja templates. 1. And I'd just create a template for their specific needs while defaulting to a more generic template. Using a Logger Using a logger let me add a few simple logging statements to all the modules for which they wanted preservation agents, events, and objects to be recorded. I just logged when the module completed its task. I didn't log anything if if failed, but that could easily have been added too. But for the project, any failures would have to get fixed and redone, so ... Logging an "agent" (or event or object) would look something like this: self.preservation_logger.info({ "entity": "agent", "name": "agent1", "fullname": "AGENT ONE", "uri": "http://agents/1", "version": "1.0" }) The logger's record format is an individual YAML doc with a timestamp as its key. It looks like so: '2018-01-01T12:00:00-0400': {'entity': 'agent', 'fullname': 'AGENT ONE', 'name': 'agent1', 'uri': 'http://agents/1', version: '1.0'} Event and object output is similar: '2018-01-01T12:00:01-0400': {'entity': 'event', 'name': 'event1', 'agent': 'agent1', 'object': 'object1'} '2018-01-01T12:00:02-0400': {'entity': 'object', 'name': 'object1', 'category': 'representation'} Reading the Log Data When this data above is read, the resulting object - let's call it MyObject - has 3 attributes: * agents: A list of PREMIS Agent compatible data. * events: A list of PREMIS Event compatible data. * objects: A list of PREMIS Object compatible data. Each list item is also an object like so: MyObject.agents[0].name # agent1 MyObject.events[0].agent # agent1 MyObject.events[0].timestamp # 2018-01-01T12:00:01-0400 MyObject.events[0].agent == MyObject.agents[0].name # True By feeding MyObject to Jinja templates, it's pretty easy to automatically create some decent METS/PREMIS output. Using Templates Using templates allowed me to a create generic templates for basic METS/PREMIS and also a template exclusively for the people I was working for, i.e. with their rights data. So changing the output would just be a matter of choosing the appropriate template file. Creating new templates allows one to change the output without altering the code. And templates can be shared. Aside: I think it's generally a really bad idea to write complex XML using pure code with something like lxml - it becomes really hard to see what's going on and what the final output will look like. Which makes it harder to debug ... which makes it costlier to maintain ... It's also pretty easy to add custom Jinja filters to output CDATA and whatnot. I also wrote in the capability to create Dublin Core (as RDF) from simple Excel files. This data would also be converted into an object for the template to read so that the METS could include some descriptive metadata. Finally, I also wrote code to represent files and folders as objects in order to populate a METS "manifest" file, i.e. <Filesec> stuff. This would be outputted to a separate file than the METS file containing both Dublin Core and PREMIS metadata. I knew that people would want to open the Dublin Core/PREMIS METS file in a text editor and look at it. And I knew that the manifest stuff could result in files so large that they couldn't be opened. Here's a snippet from the template used for storing the METS manifest's contents. {% for file in SELF.directory_obj.files() %} <file SIZE="{{ file.size }}" ID="_{{ file.index }}" MIMETYPE="{{ file.mimetype() }}" CREATED="{{ file.created }}" CHECKSUM="{{ file.checksum('SHA-256') }}" CHECKSUMTYPE="SHA-256"> <FLocat xlink:href="{{ file.name }}" LOCTYPE="OTHER" OTHERLOCTYPE="SYSTEM" /> </file> ... The snippet above allows one to see the types of file attributes available. Now, I was under the impression that we had to support Python 3.5 so I didn't use the new pathlib module for 3.6+ and ended up writing my own code to make files and folders into objects using classic os.path stuff. In using project data, the code was able to create METS manifest files over 1gb without any issues (other than time). Some of the files in the packages themselves were over 1gb so hashing large files seemed to work fine too. My Hidden Agenda I like to think about doing things a level or two above what's needed. In other words, when someone might say "Hey, let's write module to make Function X callable via a RESTful API", I might say "No, let's write a third party module that makes any function callable via a RESTful API and use it on Function X today and Function Y tomorrow." That's the same way I approached the METS/PREMIS code stuff. I knew I'd want to eventually create a generic Python module for METS/PREMIS stuff that uses templates and might be of value for other people's work as well. I just think the logging approach to preservation data is one with potential. I'm planning on creating a Python package I'll call marmalade because marmalade is a type of preserve. Granted, there's already a package with that name on PyPI [https://pypi.org/project/marmalade/] but whatevs. Anyway, I've done some basic API planning and will try to add updates as they occur. Right now I'm busy doing a little Node.js/Vue.js stuff - which I might write about too.