blog.humaneguitarist.org

can version control help with preservation metadata?

[Sun, 02 Sep 2018 23:40:01 +0000]
I wrote in my last post [http://blog.humaneguitarist.org/2018/04/26/a-faulty-premis/] that I'm doing some stuff with PREMIS for my work and that I had some issues with PREMIS' structure. What I didn't mention was my concern about the overhead in creating PREMIS data in the first place. That's to say, there's a point at which describing something becomes more laborious than doing something. If it's laborious to create PREMIS data, there's a chance that difficulty becomes a reasonable deterrent from creating the data at all or at least reasonably well. Meaning that, at some point, the cost of preservation metadata can become a contributing factor in arguing against it. Apart from implementation cost, it can have significant opportunity cost. In other words, if documenting something becomes so involved that you can't move on to the next something, then the cost of creating preservation metadata may impact your ability to try and responsibly store other files/data that need attention. Certainly, there's the argument to be made that creating preservation metadata shouldn't be viewed as a response to a preservation-related activity, but rather an equally important part of that activity - i.e. doing is not doing unless doing includes documentation of said doing. But I don't buy it. I mentioned in a post from over 10 years ago [http://blog.humaneguitarist.org/2010/04/25/libos-seeking-a-linux-distro-for-digital-libraries/] that a lot of work could be alleviated at the system level. For my work, our PREMIS data will be created automatically by the tools we created. Going forward, actions taken by our planned future tools will augment the preservation data. But again, we're stuck with automation, not at the system level, but at the specific level of a defined set of tasks within a defined scope using a defined suite of tools. I've been wondering if there isn't something that can help in the general sense, short of something at the level of the entire system. Every time I make a commit using Git, I'm thinking to myself: If my repository consists of preservation data, isn't my commit a preservation event? At the least, doesn't my user name constitute an agent? Isn't there some way to create preservation metadata based on my commit activity? In a brief online search, I couldn't see that there's something out there that does what I want. If there is, I would really like to know about it. But what I want (just to start) is a wrapper around version control software with the following features: 1. Allows the user to CRUD [https://en.wikipedia.org/wiki/Create,_read,_update_and_delete] preservation "components": AGENTs, EVENTs, OBJECTs, and RIGHTs. + Examples include people and software AGENTs, commits and code execution (EVENTs), data (OBJECTs), and RIGHTs information. + All the components will have a corresponding JSON metadata file within the version control backend. 2. Allows the user to reference said components along with the commit message (i.e. the EVENT). 3. Allows the user to execute code while storing the syntax, parameters, and output as part of an EVENT. 4. Allows the user to convert prior commit history, in whole or in part, into PREMIS XML. For example, let's say I have a repository of cat images as .tiff files and I've created a repository for them using the imaginary Preservation Metadata Version Control System with the git -like command pres . When I initiate the repository, I've already got a set of preservation OBJECTs, i.e. the cat images. As the user, I'm already registered as an AGENT too. Perhaps my operating system should be automatically registered as one as well. Below is a series of commands where I'll try to illustrate my thinking ... $ cd cat_images $ pres init # initializes the version controlled directory; registers me as an AGENT; registers init as an EVENT. $ pres read agent * # lists all AGENTs. nitaro $ pres read agent nitaro { "entity": "agent", "id": {uid_1}, "alias": "nitaro", "email": "ni...@...com" } $ pres update agent nitaro name='Nitin Arora' $ pres read agent nitaro { "entity": "agent", "id": {uid_1}, "alias": "nitaro", "email": "ni...@...com", "name": "Nitin Arora" } $ pres create agent imagemagick name=ImageMagick version=7.0 url=http://www.imagemagick.org $ pres read agent imagemagick { "entity": "agent", "id": {uid_2}, "alias": "imagemagick", "name": "ImageMagick", "version": "7.0", "url": "http://www.imagemagick.org" } $ pres read agent * --name-only nitaro imagemagick $ cmd='find . -name "*.tiff" | replace .tiff '' | xargs -i convert {}.tiff {}.png' $ pres create event tiff2png stdin=$cmd agent=imagemagick $ pres read event * --name-only init tiff2png $ pres read event tiff2png { "entity": "event", "id": "{uid_3}", "alias": "tiff2png", "stdin": "./find . -name \"*.tiff\" | replace.tiff '' | xargs - i convert {}.tiff {}.png ", "agents": [ "nitaro", "imagemagick" ] } $ pres update event tiff2png -bash # executes "stdin" value. $ pres read event tiff2png { "entity": "event", "id": "{uid_3}", "alias": "tiff2png", "stdin": "./find . -name \"*.tiff\" | replace.tiff '' | xargs - i convert {}.tiff {}.png ", "stdout": "", "stderr": "", "exit": 0, "agents": [ "nitaro", "imagemagick" ] } $ pres commit -m "Created deliverable PNG files for all TIFF files." -event tiff2png $ pres read event tiff2png { "entity": "event", "id": "{uid_3}", "alias": "tiff2png", "detail": "Created deliverable PNG files for all TIFF files.", "commit": "6e84edd2baafdffb38637f2cbff31bcb893aa447", "commit_date": "2018-07-29T01:13:33+00:00", ... } $ pres render cat_premis.xml -commit * -template default.xml # renders a PREMIS file for all commits using the specified PREMIS template (e.g. Jinja template, etc.). This post was mostly a public brainstorm, so ignore the details in the example above. They will need to map to PREMIS elements, but the example was meant to demonstrate the intent but not a real implementation. My idea isn't to have something that creates hyper-detailed PREMIS, but something that's good enough and reduces the overhead of doing so. Maybe I'm just advocating creating a different type of overhead though. I'm thinking of drafting a specification for a proof-of-concept version and trying it with Mercurial since it's Python-based.