blog.humaneguitarist.org

sorta sorting API results with in-memory SQLite

[Sat, 11 Aug 2012 04:36:24 +0000]
I'll try to keep this short because it's looking like the weather is going to be agreeable enough for a nice, long Saturday walk. So, I've been working on a mockup API at work that could, among other things, drive an in-site federated search across things like our EBSCO databases, the other vendor resources available through EBSCO's API using SRU, and of course our own databases with lists of the resources we offer and their descriptions. Using simple textual similarity libraries it's easy to have the API return a text similarity score (a trick I learned working on HammerFlicks!) comparing the query against the title of each item. This way if someone types in "Wall St. Journal" it's easy to highlight (through an HTML/JavaScript page) the hit for "Wall Street Journal" from our own database because that'll be a good text similarity match. Here's a snippet showing the similarity attribute: <?xml version="1.0"?> <nclive_api_response> <results source="ncl_resource_titles"> <result text_similarity_score="86.666666666667"> <title>Wall Street Journal</title> <url>http://www.nclive.org/cgi-bin/nclsm?rsrc=29</url> <description>Full articles from the Wall Street Journal (1981-current).</description> </result> … </results> </nclive_api_response> I was toying with the idea, though, of testing what it would be like to – after the fact – index all the returned results on the fly in Solr or something just to get a relevancy ranking for the results the API returns. Now, this isn't of course arguing that this would be a total relevancy rank across all sources. In other words, if you only pull five items from each "sub-API", each data source mentioned above, then there's no way to say that the first item from Database A is necessarily more relevant than the fifth result of Database B. Anyway, I thought it was stupid to index things behind the scenes in something external just to get an on-the-fly relevancy rank to inject into the API results, when I'd only then have to quickly delete the entire index since I would just be using it to get a score. But what I don't think is too stupid is the idea itself. It's making the argument that "Look, I've asked these different sources to send me their best stuff and now I'll have a way to rank them with my own criteria … because they're mine now." It's like using your own criteria to rank job candidates after asking a few of your industry friends to each send in their five best employees for the job you're hiring for. You're not necessarily going to agree with how they rank their own employees but you do trust that they've sent you five top notch folks. … and so, after a colleague in another department asked if there would be a way to sort items across multiple data sources, I thought to investigate a way to do the indexing and have some kind of ranking/relevancy score done all in memory. Enter SQLite. This is really cool. With SQLite, I can create a full-text index/searchable on-the-fly database in memory that will let me develop some kind of rank per item. Note, one has to have SQLite with FTS3/FTS4 enabled to do full-text with SQLite. Now, the way I'm doing this is to use SQLite's offsets() function to learn – for each search term/word passed to the API – if it or its Porter-based stem matches in the TITLE field (for which each hit gets, say, 2 points) or the DESCRIPTION field (1 point). After getting the total points, I'm dividing the points by the total number of words within the API's TITLE + DESCRIPTION values to get a scaled result between 0 and 1. Anyway, I've got a starter function below (PHP) that would return what I'm calling a "sorta" score. It'll be interesting to work it into the mockup API to see how it works in the real world in trying to sort items from across different sources. And just to be clear, I'm doing this per item. That's to say I do these calculations for one item then delete the in-memory database. In other words, I'm not indexing all the API results in memory and then getting this "sorta" rank per item because the calculation is agnostic of the other items. Now, if I changed the calculation to consider the other items as well, then absolutely there would be a need to index all the items first before assigning a "sorta" score per item. BTW. Get it … "sorta"? … 'cause it's "sort of" a way to sort things from multiple sources. Ha! :P Anyway, the PHP's below followed by another PHP block that uses the function and then an HTML snippet of what gets returned with sample text. And so much for my walk, looks like rain's on the way. Dammit. <?php //clean out special chars, etc. function recharacter_this($htmlstring) { $htmlstring = htmlspecialchars($htmlstring, ENT_QUOTES); $htmlstring = trim($htmlstring); $htmlstring = preg_replace("/[^A-Za-z0-9]\s/", "", $htmlstring); //leave only alpha-numerics and whitespace $htmlstring = preg_replace("/\s+/", " ", $htmlstring); //replace multiple whitespaces with a single space return $htmlstring; } //get a rank score function sorta_this($title, $description, $search_text) { $title = recharacter_this($title); $description = recharacter_this($description); $search_text = recharacter_this($search_text); //re: SQLite/PHP fundamentals, see: http://www.if-not-true-then-false.com/2012/php-pdo-sqlite3-example/ //create memory db $memory_db = null; $memory_db = new PDO('sqlite::memory:'); //errormode set to exceptions $memory_db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION); //create table //you must use "VIRTUAL TABLE" for FTS3/4, see: http://www.sqlite.org/fts3.html#section_1_2 $memory_db->exec("CREATE VIRTUAL TABLE box using FTS4 ( id, title, description, tokenize=porter)"); //porter > simple because a search for "tree" matches up against text with "trees" where as "tokenize=simple" tokenization doesn't seem to do this; //granted, Porter stemming has its own problems, but it's better than nothing. $insert = "INSERT INTO box (id, title, description) VALUES('1', '$title', '$description')"; $stmt = $memory_db->exec($insert); //insert values per above $search_text = str_replace(" ", " OR ", $search_text); //making search more liberal $query = "SELECT quote(offsets(box)) as rank FROM box WHERE box MATCH '$search_text' ORDER BY rank"; $result = $memory_db->query($query); //run query per above $score = 0; //start with initial score of Zero $i = 0; //to use during iteration //if query yielded anything ... if ($result) { //there's only one row, but still need to loop foreach($result as $row) { $rank = $row['rank']; preg_match_all("/[a-zA-Z0-9]+\ [a-zA-Z0-9]+\ [a-zA-Z0-9]+\ [a-zA-Z0-9]+/", $rank, $matches); //split at every 4th space, i.e. every quartet returned by SQLite offsets(); see: http://stackoverflow.com/questions/10555698/split-string-after-every-five-words //$matches is a single item array with one array inside it for each quartet; $matches[0] is thus just a plain array foreach ($matches[0] as $match) { if ($match[0] == 1) { //if search hits in TITLE field, get 2 points $score = $score + 2; } else { //if in DESCRIPTION field, get 1 point $score = $score + 1; } $i = $i + 1; } } } $memory_db->exec("DROP TABLE box"); $memory_db = null; $total_words = str_word_count($title) + str_word_count($description); $score = ($score/$total_words); //divide $score by total number of words in TITLE + DESCRIPTION //prevent scores greater than 1, which would only occur with an abnormally small number of total words (essentially <= to the number of words in search terms) if ($score > 1) { $score = 1; } return $score; } ?> Using the function with TITLE and DESCRIPTION (abstract) from this article ... <?php //test sorta_this() function $my_title = ("An aerobic walking programme versus muscle strengthening programme for chronic low back pain: a randomized controlled trial."); $my_description = ("Objective:To assess the effect of aerobic walking training as compared to active training, which includes muscle strengthening, on functional abilities among patients with chronic low back pain.Design:Randomized controlled clinical trial with blind assessors.Setting:Outpatient clinic.Subjects:Fifty-two sedentary patients, aged 18-65 years with chronic low back pain. Patients who were post surgery, post trauma, with cardiovascular problems, and with oncological disease were excluded.Intervention:Experimental 'walking' group: moderate intense treadmill walking; control 'exercise' group: specific low back exercise; both, twice a week for six weeks.Main measures:Six-minute walking test, Fear-Avoidance Belief Questionnaire, back and abdomen muscle endurance tests, Oswestry Disability Questionnaire, Low Back Pain Functional Scale (LBPFS).Results:Significant improvements were noted in all outcome measures in both groups with non-significant difference between groups. The mean distance in metres covered during 6 minutes increased by 70.7 (95% confidence interval (CI) 12.3-127.7) in the 'walking' group and by 43.8 (95% CI 19.6-68.0) in the 'exercise' group. The trunk flexor endurance test showed significant improvement in both groups, increasing by 0.6 (95% CI 0.0-1.1) in the 'walking' group and by 1.1 (95% CI 0.3-1.8) in the 'exercise' group.Conclusions:A six-week walk training programme was as effective as six weeks of specific strengthening exercises programme for the low back."); $my_search_text = ("back pain exercise"); $my_score = sorta_this($my_title, $my_description, $my_search_text); echo ("Searching for \"$my_search_text\" in <br /><br />TITLE: <em>$my_title</em> <br /><br />and <br /><br />DESCRIPTION: <em>$my_description</em> <br /><br />yields a \"sorta\" relevancy of<strong> "); echo $my_score . "</strong><br /><br />"; echo ("<hr />Hits for each search word in TITLE get 2 points, hits in DESCRIPTION get 1 point.<br />This number is then divided by the total number of words in the TITLE + DESCRIPTION."); ?> The results ... Searching for "back pain exercise" in TITLE: An aerobic walking programme versus muscle strengthening programme for chronic low back pain: a randomized controlled trial. and DESCRIPTION: Objective:To assess the effect of aerobic walking training as compared to active training, which includes muscle strengthening, on functional abilities among patients with chronic low back pain.Design:Randomized controlled clinical trial with blind assessors.Setting:Outpatient clinic.Subjects:Fifty-two sedentary patients, aged 18-65 years with chronic low back pain. Patients who were post surgery, post trauma, with cardiovascular problems, and with oncological disease were excluded.Intervention:Experimental 'walking' group: moderate intense treadmill walking; control 'exercise' group: specific low back exercise; both, twice a week for six weeks.Main measures:Six-minute walking test, Fear-Avoidance Belief Questionnaire, back and abdomen muscle endurance tests, Oswestry Disability Questionnaire, Low Back Pain Functional Scale (LBPFS).Results:Significant improvements were noted in all outcome measures in both groups with non-significant difference between groups. The mean distance in metres covered during 6 minutes increased by 70.7 (95% confidence interval (CI) 12.3-127.7) in the 'walking' group and by 43.8 (95% CI 19.6-68.0) in the 'exercise' group. The trunk flexor endurance test showed significant improvement in both groups, increasing by 0.6 (95% CI 0.0-1.1) in the 'walking' group and by 1.1 (95% CI 0.3-1.8) in the 'exercise' group.Conclusions:A six-week walk training programme was as effective as six weeks of specific strengthening exercises programme for the low back. yields a "sorta" relevancy of 0.065 ___________________________________________________________________________ Hits for each search word in TITLE get 2 points, hits in DESCRIPTION get 1 point. This number is then divided by the total number of words in TITLE + DESCRIPTION.