blog.humaneguitarist.org

full-text searching of timed text and a farewell to Andy Roddick

[Wed, 05 Sep 2012 22:17:04 +0000]
It's been a while since I had one of my "So, I'm home sick today and wrote this silly, little script" things. Well, here's another one while the antibiotics take root. I've always wanted to do something with offering full-text search against timed-text files and allowing a user to click on a result and skip to the audio segment matching the returned line of timed-text, etc. Hulu has had a BETA version of this kind of thing for a while and I suspect others do too. Well, today I just whipped up a little search API using PHP and MySQL. It's a nice little start and super easy to do. I made a database table using the timed-text data from my SAVS [http://blog.humaneguitarist.org/tag/savs/] project, OpenOffice Calc, and phpMyAdmin. The text is from Shakepeare's Sonnet 130 using a LibriVox recording [http://librivox.org/sonnet-130-by-william-shakespeare/] (version #14, Miller). BTW, parsing DFXP or SRT files and throwing those into a table is easy, but it's not within the scope of this little mock-up. If I send a query for "rare love" to the API as such: <a href="http://blog.humaneguitarist.org/uploads/SAVS/currentVersion/search/?q=rare%20love">http://blog.humaneguitarist.org/uploads/SAVS/currentVersion/search/?q=rare%20love [http://blog.humaneguitarist.org/uploads/SAVS/currentVersion/search/?q=rare%20love]</a> … I get the following JSON response: { "results":{ "result":[ { "text":"And yet, by heaven, I think my love as rare", "highlighted_text":"And yet, by heaven, I think my <mark>love<\/mark> as <mark>rare<\/mark>", "startTime":"49", "stopTime":"54", "source":"sonnet130_shakespeare_njm", "relevance":"4.04993200302124" }, { "text":"I love to hear her speak, yet well I know", "highlighted_text":"I <mark>love<\/mark> to hear her speak, yet well I know", "startTime":"37", "stopTime":"40", "source":"sonnet130_shakespeare_njm", "relevance":"1.64789569377899" } ] } } Note that the text is returned in the "text" field and I'm also trying to return a "highlighted_text" field in which search terms are surrounded by the HTML5 "mark" tag. There's also a relevance score ... of sorts (pun!). It needs a lot of work, but there's enough data returned to launch an audio segment using some HTML5/JavaScript or some Flash or Silverlight API, etc. Hey, it ain't too bad for a bad stomach and some sports-entertainment distractions. Below, I'll paste the CSV file I used to make the table, the PHP script ... and a personal note about the best male American tennis professional of the last decade. Here's the CSV file from the spreadsheet application (note the "line_text" field is full-text indexed in the database): "line_id";"line_text";"start_time";"stop_time";"file_prefix" "1";"Coral is far more red than her lips' red:";"13";"17";"sonnet130_shakespeare_njm" "2";"If snow be white, why then her breasts are dun;";"17";"21";"sonnet130_shakespeare_njm" "3";"If hairs be wires, black wires grow on her head.";"21";"26";"sonnet130_shakespeare_njm" "4";"I have seen roses damask'd, red and white,";"26";"29";"sonnet130_shakespeare_njm" "5";"But no such roses see I in her cheeks;";"29";"32";"sonnet130_shakespeare_njm" "6";"And in some perfumes is there more delight";"32";"34";"sonnet130_shakespeare_njm" "7";"Than in the breath that from my mistress reeks.";"34";"37";"sonnet130_shakespeare_njm" "8";"I love to hear her speak, yet well I know";"37";"40";"sonnet130_shakespeare_njm" "9";"That music hath a far more pleasing sound:";"40";"43";"sonnet130_shakespeare_njm" "10";"I grant I never saw a goddess go, --";"43";"46";"sonnet130_shakespeare_njm" "11";"My mistress, when she walks, treads on the ground:";"46";"49";"sonnet130_shakespeare_njm" "12";"And yet, by heaven, I think my love as rare";"49";"54";"sonnet130_shakespeare_njm" "13";"As any she belied with false compare.";"54";"56";"sonnet130_shakespeare_njm" Here's the PHP script: <?php //GET search words from URL parameter $searchWords = trim($_GET["q"]); //prepare for highlighting keywords $search_array= explode(" ", $searchWords); //prepare for output $output = array(); //connect to database include_once("db_setup.php"); //run query $searchWords = mysql_real_escape_string($searchWords); $query = "SELECT *, MATCH(line_text) AGAINST(\"$searchWords\") AS relevance FROM $table WHERE MATCH(line_text) AGAINST(\"$searchWords\" IN BOOLEAN mode) ORDER BY relevance DESC"; $result = mysql_query($query); if($result) { while($row = mysql_fetch_array($result)) { $line_text = $row["line_text"]; $start_time = $row["start_time"]; $stop_time = $row["stop_time"]; $file_prefix = $row["file_prefix"]; $relevance = $row["relevance"]; //highlight seach words in line_text $highlighted_text = $line_text; foreach ($search_array as $word) { $highlighted_text = str_ireplace($word, "<mark>$word</mark>", $highlighted_text); } $this_output = array("text" => htmlspecialchars($line_text), "highlighted_text" => htmlspecialchars($highlighted_text), "startTime" => $start_time, "stopTime" => $stop_time, "source" => $file_prefix, "relevance" => $relevance); array_push($output, $this_output); } } //send JSON results if (count($output) == 0) { $results = array("results" => "No results."); } else { $result = array("result" => $output); $results = array("results" => $result); } $response = json_encode($results); include_once("indent_json.php"); header("Content-type: application/json; charset=UTF-8"); echo(indent_json($response)); ?> And here's something more important. As a huge tennis fan, today was a melancholy one for me as Andy Roddick played his last match, having just lost a few moments ago to Juan Martin del Potro. The Wikipedia article on Roddick here [http://en.wikipedia.org/wiki/Andy_Roddick] already lists him as retired but the important thing to remember about Roddick is that he achieved more with less than a lot of other players with more talent and was entertaining to watch, win or loose, in big matches. Thanks for the memories!