blog.humaneguitarist.org

indexing and searching timed text with Solr

[Sun, 16 Oct 2011 14:54:33 +0000]
I'm still learning about Solr so maybe this post is much ado about nothing. But according to this [http://lucene.472066.n3.nabble.com/Storing-indexing-and-searching-XML-documents-in-Solr-tp2958452p2959272.html] nabble.com thread, one can't index a source XML document in Solr with it's native XML structure intact and then in turn search that structure as one can in an XML database like BaseX [http://basex.org/]. For most things, that's fine. I mean for indexing titles, creators, and descriptions, etc. I just need to index the value of a given element like so that I can search for that element's value. But for timed text, it's different. Or at least, it can be. Say I have this DFXP snippet for an audio file with an "id" value of "XYZ". <p begin="10.0s" end="30.0s">Hello world!</p> I would need the user to search for the string "Hello world!" or part of it but I would also need to index at least the value of the "begin" attribute so that I can pass that to a page that will play the file "XYZ" starting at the 10 second mark - if the user clicks on the "Hello world!" line in their search result. And I don't want the "10" second value to be something they search against since they might be searching for the string "10" within the text itself. So I'm wondering how to do that with Solr. Maybe when I learn more I'll discover a better way to do this, but for now I'm thinking I could do the following: First, I would pretty much index the timed text twice in Solr. <doc><br/> <field name="id">XYZ</field><br/> ...<br/> <field name="timedText-stripped">Hello world!</field><br/> <field name="timedText">Hello World! {10}</field><br/> </doc><br/> After indexing the "id" of the audio file this would index: * just the text "Hello world!" * the text of "Hello world!" with the "begin" attribute value in curly quotes. I guess this way the user could be made to search across the "timedText-stripped" field but, via the XSL that can be passed to Solr to display results, the "timedText" field could be displayed in a manner that would make the text "Hello World!" linked to whatever file will play file "XYZ" starting at the 10 second mark. Basically, by planting the "begin" value in curly quotes, I can parse the string for the text and the "begin" value as separate things. So, here's a really crappy XSL snippet that would do something like that. It assumes a variable "$id" exists that equals "XYZ", the identifier for the example audio file. <xsl:for-each select="//field[@name='timedText']"> <xsl:variable name="whole"> <xsl:value-of select="."/> <!-- Gets entire element string --> </xsl:variable> <xsl:variable name="text"> <xsl:value-of select="substring-before($whole,'{')"/> <!-- Gets text prior to seconds --> </xsl:variable> <xsl:variable name="begin"> <xsl:value-of select="substring-before(substring-after($whole,'{'),'}')"/> <!-- Gets seconds value from end of string --> </xsl:variable> <a href="someMediaPlayer.php?id={$id)&begin={$begin}"> <xsl:value-of select="$text"/> </a> <!-- So, I'm saying that "someMediaPlayer.php?id=XYZ&start=10" would launch a player that would start file XYZ at the 10 seconds mark. --> </xsl:for-each> The search output would be some HTML code like so: <a href="someMediaPlayer.php?id=XYZ<tt>&amp;begin=10>Hello World!</a></tt> It seems weird to index something twice, more or less, but as user Erick says in the nabble.com thread, "You've gotta take off your DB hat and not worry about duplicating data." But now as I write this, I'm wondering if I can't just index as follows: <field name="text">Hello world!</field><br/> <field name="begin">10</field> and trust that for each "text" field, there will be a matching "begin" field and that they can't just be used in tandem to create the same HTML link as above. Sounds like I need to play around some more. :) Update, September 6, 2012: I wrote a related post [http://blog.humaneguitarist.org/2012/09/05/full-text-searching-of-timed-text-and-a-farewell-to-andy-roddick/] to this yesterday in terms of searching across timed text with MySQL and in doing so I realized that the way I was thinking of doing it in Solr was off. Rather than doing it the way I outlined in the original post content (above) in which I was thinking to index all the timed text for a given recording in one Solr "doc" element, I think it makes much more sense to index each line in its own "doc" element as such: <doc> <field name="id">someMediaPlayer.php?source=someFile.mp3&begin=10&end=30</field> ... <field name="startTime">10</field> <field name="stopTime">30</field> <field name="timedText">Hello world!</field> <field name="source">someFile.mp3</field> </doc> That way there's no need to post-parse any data fields to get the start and stop time. And, moreover, rather than construct the URL to launch that segment of audio you can just put the URL directly in the "id" field. You can always use Solr built-in support for facets to facet off of the "source" field or some descriptive metadata like "title". I'll file the original post under the "thinking out loud yet poorly" category. </div></body> </html>