blog.humaneguitarist.org

bidi bidi bidi and more on pOAIndexter-ing metadata

[Sat, 15 Oct 2011 13:05:55 +0000]
It's shaping up to be a sunny day and this means I need to go on a long walk. But before I do that, I'll follow up to this [http://blog.humaneguitarist.org/2011/10/02/poaindexter-grabbing-and-indexing-online-metadata/] post about grabbing OAI metadata from an online source and throwing the metadata into Solr for searching purposes, etc. Last night - while [DEL: watching :DEL] streaming the Gil Gerard iteration of Buck Rogers [http://www.imdb.com/title/tt0078579/] - I wrote a small PHP script to grab this [http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&metadataPrefix=oai_dc&set=papr] OAI metadata from the Library of Congress' site. BTW: this [http://memory.loc.gov/ammem/oamh/oai_request.html] is a cool page of theirs that helps one get started with OAI feeds, etc. Aside: Is it only since the advent of hypertext that the word "this" began appearing in a referential context within documents? As I mentioned in the previous post, an XML config file will instruct the code where to get the metadata and which XSL file will be used to transform the data into something Solr can chew on. I haven't bothered with the config file yet, so for now I just tested it on the specific metadata linked to above since the config file aspect of this is the most trivial component of the whole thing. Anyway, below is the PHP file, the OAI to Solr XSL file, and a snippet of the output. Last is a Python script that does the same thing as the PHP. It's not OO like the PHP file, but I just whipped it up this morning for shiggles [http://www.urbandictionary.com/define.php?term=shiggles]. Here's the PHP ... <?php function grabMetadata($urlArg) { $ch = curl_init(); // see: http://php.net/manual/en/book.curl.php curl_setopt($ch, CURLOPT_URL, $urlArg); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $curlOut = curl_exec($ch); return $curlOut; curl_close($ch); } // See "http://www.php.net/manual/en/xsltprocessor.transformtoxml.php" for instructions re: XSL processing as below. function useXSL($output) { $search_results = new DOMDocument; $search_results->loadXML($output); // If you just use "load" instead of "loadXML" it won't work unless you first stored the XML results in a file (boo!). // For info on "loadXML" see: http://www.php.net/manual/en/domdocument.loadxml.php $proc = new XSLTProcessor; $xsl = new DOMDocument; $xsl->load('OAI_to_solr.xsl'); $proc->importStyleSheet($xsl); $processed = $proc->transformToXML($search_results); return $processed; } function writeSOLR($solrXML) { $myFile = "for_solr-PHP.xml"; $fh = fopen($myFile, 'w') or die("can't open file"); fwrite($fh, utf8_encode($solrXML)); // For UTF-8, see: http://www.php.net/manual/en/function.fwrite.php#73764 fclose($fh); } // Do stuff ... $output = grabMetadata('http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&metadataPrefix=oai_dc&set=papr'); writeSOLR(useXSL($output)); ?> The XSL file … <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" exclude-result-prefixes="oai_dc dc"> <xsl:output method="xml" indent="yes" encoding="UTF-8"/> <xsl:template match="/"> <add> <xsl:for-each select="//oai_dc:dc"> <doc> <field name="identifier"> <xsl:value-of select="dc:identifier" /> </field> <field name="title"> <xsl:value-of select="dc:title" /> </field> <field name="creator"> <xsl:value-of select="dc:creator" /> </field> <xsl:for-each select="dc:subject"> <field name="subject"> <xsl:value-of select="." /> </field> </xsl:for-each> <field name="description"> <xsl:value-of select="dc:description" /> </field> </doc> </xsl:for-each> </add> </xsl:template> </xsl:stylesheet> The Millionare and his wife … er, wrong show [http://www.imdb.com/title/tt0057751/]. I mean the sample Solr XML snippet ... <add> <doc> <field name="identifier">http://hdl.loc.gov/loc.mbrsmi/amrlv.4007</field> <field name="title">[Theater commercial--electric refrigerators]. Buy an electric refrigerator /</field> <field name="creator">AFI/Kalinowski (Eugene) Collection (Library of Congress)</field> <field name="subject">Refrigerators.</field> <field name="subject">Advertising--Electric household appliances--Pennsylvania--Pittsburgh.</field> <field name="subject">Trade shows--Pennsylvania--Pittsburgh.</field> <field name="subject">Silent films.</field> <field name="subject">Pittsburgh (Pa.)--Manufactures.</field> <field name="description">Largely graphic commercial for electric refrigerators in general and a refrigerator show, presumably in Pittsburgh, in particular.</field> </doc> ... </add> Some Python for fun ... import codecs import urllib from lxml import etree, _elementpath # see: http://lxml.de/ from lxml.etree import XSLT,fromstring ## some OAI metadata from the Library of Congress url = 'http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&metadataPrefix=oai_dc&set=papr' metadata = urllib.urlopen(url).read() metadata = etree.XML(metadata) ## the XSL file that will transform the OAI metadata to Solr xsl = open('OAI_to_solr.xsl', 'r') xsl = xsl.read() xsl = etree.XML(xsl) ## XSL transformation style = XSLT(xsl) result = style.apply(metadata) ## the outputted Solr XML fw = codecs.open('for_solr-PY.xml', 'w', 'utf-8-sig') utf8_result = unicode(str(result), encoding='utf8') fw.write(utf8_result) fw.close() And most importantly, the introduction to Buck Rogers in the 25^th Century - Season 1, of course! I couldn't even make it through the first ten minutes of the Season 2 opener. I mean they changed the introduction which was brilliant and brilliantly narrated - as you shall see! [EMBED] I'd prefer to watch the South Park spoof over the Season 2 insult-to-perfection any day of the week. [EMBED] And here's a bad-ass fan trailer that I think respects the greatness of the first season. IFRAME: http://www.youtube.com/embed/4szGxaKF8Qw