blog.humaneguitarist.org

pOAIndexter: grabbing and indexing online metadata

[Sun, 02 Oct 2011 15:20:09 +0000]
As per usual, a good bit of my computer-y stuff at home relates to something that's come up at work. And as usual, I'm pretty ignorant of what I'm getting myself into, but I don't mind. The other week, my boss and I met with some great people at digitalnc.org [http://digitalnc.org/] and we started talking about the idea of having a super simple, lightweight approach to providing a one-stop-shop search interface for collections across the state - provided those collections expose their metadata somehow. For now, we talked about limiting this to people who do so with an OAI feed and grabbing that metadata. But eventually, this thing should be metadata agnostic - in the sense that it isn't about a metadata format, but just the data itself. By the way, I guess "grabbing" and "feed" aren't what I typically see with OAI - about which I admittedly don't know much - but I don't care. Same difference. Of course, there's nothing new to this. I guess one could use Blacklight [http://projectblacklight.org/] or VuFind [http://vufind.org] to do this kind of thing, but I'm not sure, though even those are existing open souce projects, that doing so isn't overkill and won't in turn increase dependencies and maintenance overhead. Actually, that's a topic for another time - I mean the idea that just because part of something is capable of doing what you want doesn't necessarily make it a better option than rolling one's own if using and updating said something entails more cost in the long run. Paved roads often get you there faster, but a willingness to get lost now and then is how you learn where all the really cool local bars are ... ;) Anyway, here's what I'm thinking. A small script would simply look at an XML setup file from which it would know which places to go grab metadata from, the type of feed, the last time the metadata was requested, and stuff like the resumptionToken [http://www.oaforum.org/tutorial/english/page4.htm#section17] if applicable. It would also store the appropriate XSL file to process the metadata with so that the metadata could be passed into Solr [http://lucene.apache.org/solr/#intro] to be indexed and searchable. Anyone who's site doesn't provide metadata as XML could simply create a web service that does so, e.g. a RESTful MySQL to XML thingamajig. The outputted XML just needs to have an XSL that will facilitate passing it to Solr for that data to be part of the shared metadata store. And since XSL is the universal translator [http://en.wikipedia.org/wiki/Universal_translator#Star_Trek] in this context, other metadata types such as RSS/ATOM feeds could be grabbed, too. All one needs to do is add to the XML config file so the script knows to retrieve metadata from that site and make sure there's an XSL file that can be used to facilitate passing the data into Solr. So in the end all this should take in terms of coding is a small script, one XML config file, and as many XSL files as needed. For fun and to start learning about Solr, I just manually grabbed some OAI metadata from CalTech [http://oralhistories.library.caltech.edu/perl/oai2?verb=ListRecords&set=7375626A656374733D737562:68756D&metadataPrefix=oai_dc] yesterday - it was for some oral histories. And then I ran them through an XSL file and then posted them to Solr. Within no time I had a searchable, local metadata store to play around with (screenshot below). Since I was using all the defaults from the Solr tutorial [http://lucene.apache.org/solr/tutorial.html] I had to map the field to things like manufacturer, since the default is set up for an electronics store. IMAGE: "Solr screenshot"[http://blog.humaneguitarist.org/uploads/solrUI_screenshot.png] BTW if we use this, at some point I won't be able to call it "pOAIndexter" but for now I can. Since I don't know if I'll do this in Python or PHP and since OAI is what we'll work on first, I guess it stands for "Python or PHP OAI Indexer". Yes, I'm a dork.