facet mashing, a tragedy in 0.987 acts

Update, March 21, 2012: I'm at DrupalCon 2012 [http://denver2012.drupal.org/] and after going to a session on node.js - which I've had in the back of my head as a potential replacement for Python for some metadata harvesting software I'm working on - I was reminded of OpenCalais [http://www.opencalais.com/] which I haven't looked at in forever, probably because I wouldn't have understood it before. Anyway, maybe that's a solution to the issues I'm describing below in terms of generating some sort of browse-able facets. This is definitely something to look into. ... Home sick again, so that means another meaningless contribution to the "blogosphere" ... So, I've been working with some folks on a project to make a single site search for digital collections across the state I work in. We're using Solr for the index and OAI feeds for now even though the metadata harvesting software is agnostic of OAI and can support other feed types, etc. But that's not the point here ... The point is that metadata coming in from different places makes for a mess if you want to expose facets ... and we might veer to not showing them because noone wants to get into the murky waters of trying to control for that across multiple places. I think subject facets are still useful though because I like to "play around", to stumble in the dark, and just have fun. But, of course, there's still the fact-of-the-matter that across multiple institutions you might see subjects from one place written as "Asheville, NC" and another as "Asheville, (N.C.)". Well, that stinks. There are essentially the same thing, but would get exposed as two separate facets. So, in the spirit of stumbling in the dark, last Saturday morning I worked on a preliminary little function in Python to try and merge strings like the Asheville example above. The idea is that the function should present to the user the version that has more "votes", i.e. the one that has more matches in the current search results. So, if "Asheville, NC" appeared 10 times and "Asheville, (N.C.)" appeared 15 times in the user's search results, the function would display "Asheville, (N.C.)" to the user and say it has 25 matches. When the user clicks "Asheville, (N.C.)" a search would be launched for either "Asheville, (N.C.)" or "Asheville, NC". Essentially, the idea is to beautify the facets at the last possible moment (i.e. through a function in the user interface) so the user doesn't have to see the ugly reality of metadata from all over the place; it's also about rectifying things based on text similarity not on semantic similarity - which is another ballgame altogether. The function uses some known string similarity methods. It's promising but there's still lots of work to do if I really decide to pursue this. And by "lots of work" I really mean seeing if someone with the proper computer science and linguistic background has already written a library for this kind of thing. And (adding this the day after I originally wrote this), I also need to play with s-match [http://semanticmatching.org/]. Anyway, the test code is below and the results are below that but I need to stop writing because I'm dropping out and need to take a nap. :/ ##### def facetMasher(x,y): info = "Comparing \"%s\" with %s facets, against \"%s\" with %s facets." %(x[0],x[1],y[0],y[1]) print info output = "" import Levenshtein #Windows32/Python 2.7 installer: http://sourceforge.net/projects/translate/files/python-Levenshtein/ lev = Levenshtein.jaro myJaro = lev(x[0], y[0]) lev2 = Levenshtein.distance myDist = lev2(x[0], y[0]) print "Jaro-Winkler score: ", myJaro print "Levenshtein distance: ", myDist if myJaro > .95 or (myJaro > .75 and myDist < 10): if myDist > 1: totalFacets = x[1] + y[1] if (x[1] >= y[1]): mergedString = x[0] else: mergedString = y[0] output = "Merging to \"%s\" with %s facets." %(mergedString, totalFacets) if output == "": output = "Keeping \"%s\" with %s facets, and \"%s\" with %s facets." %(x[0],x[1],y[0],y[1]) print output print ("--\n") ##### tests ... facetMasher (("Bibles",3),("bible",2)) #interesting ... facetMasher (("Fibles",3),("fible",2)) facetMasher (("World War 1",3),("World War 2",2)) facetMasher (("Images",4),("image",3)) facetMasher (("Images",2),("movies",3)) facetMasher (("Asheville, NC",3),("Asheville (N.C.)",2)) facetMasher (("Asheville, (NC)",3),("Asheville (N.C.)",2)) facetMasher (("Granville County (N.C.)",120),("Granville County, N.C.",2)) facetMasher (("foo & bar",3),("foo and bar",2)) facetMasher (("United States--History--Civil War, 1861-1865",3),("United States--History--Civil War, 1861-1865--Correspondence",2)) facetMasher (("United States--History--World War II",3),("United States--History--World War I",2)) facetMasher (("United States--History--World War Two",3),("United States--History--World War 2",2)) facetMasher (("United States--History--World War Two",3),("United States--History--World War 1",2)) facetMasher (("United States--History--World War 1",3),("United States--History--World War 2",2)) And here are the results, below. It's interesting how "Bibles" vs. "bible" doesn't merge, yet "Fibles" and "fible" do. Also, there are some undesired results such as merging "United States--History--World War Two" with "United States--History--World War 1" because the algorithm still sucks. Comparing "Bibles" with 3 facets, against "bible" with 2 facets. Jaro-Winkler score: 0.738888888889 Levenshtein distance: 2 Keeping "Bibles" with 3 facets, and "bible" with 2 facets. -- Comparing "Fibles" with 3 facets, against "fible" with 2 facets. Jaro-Winkler score: 0.822222222222 Levenshtein distance: 2 Merging to "Fibles" with 5 facets. -- Comparing "World War 1" with 3 facets, against "World War 2" with 2 facets. Jaro-Winkler score: 0.939393939394 Levenshtein distance: 1 Keeping "World War 1" with 3 facets, and "World War 2" with 2 facets. -- Comparing "Images" with 4 facets, against "image" with 3 facets. Jaro-Winkler score: 0.822222222222 Levenshtein distance: 2 Merging to "Images" with 7 facets. -- Comparing "Images" with 2 facets, against "movies" with 3 facets. Jaro-Winkler score: 0.666666666667 Levenshtein distance: 4 Keeping "Images" with 2 facets, and "movies" with 3 facets. -- Comparing "Asheville, NC" with 3 facets, against "Asheville (N.C.)" with 2 facets. Jaro-Winkler score: 0.891025641026 Levenshtein distance: 5 Merging to "Asheville, NC" with 5 facets. -- Comparing "Asheville, (NC)" with 3 facets, against "Asheville (N.C.)" with 2 facets. Jaro-Winkler score: 0.936111111111 Levenshtein distance: 3 Merging to "Asheville, (NC)" with 5 facets. -- Comparing "Granville County (N.C.)" with 120 facets, against "Granville County, N.C." with 2 facets. Jaro-Winkler score: 0.955862977602 Levenshtein distance: 3 Merging to "Granville County (N.C.)" with 122 facets. -- Comparing "foo & bar" with 3 facets, against "foo and bar" with 2 facets. Jaro-Winkler score: 0.809553872054 Levenshtein distance: 3 Merging to "foo & bar" with 5 facets. -- Comparing "United States--History--Civil War, 1861-1865" with 3 facets, against "United States--History--Civil War, 1861-1865--Correspondence" with 2 facets. Jaro-Winkler score: 0.911111111111 Levenshtein distance: 16 Keeping "United States--History--Civil War, 1861-1865" with 3 facets, and "United States--History--Civil War, 1861-1865--Correspondence" with 2 facets. -- Comparing "United States--History--World War II" with 3 facets, against "United States--History--World War I" with 2 facets. Jaro-Winkler score: 0.990740740741 Levenshtein distance: 1 Keeping "United States--History--World War II" with 3 facets, and "United States--History--World War I" with 2 facets. -- Comparing "United States--History--World War Two" with 3 facets, against "United States--History--World War 2" with 2 facets. Jaro-Winkler score: 0.963449163449 Levenshtein distance: 3 Merging to "United States--History--World War Two" with 5 facets. -- Comparing "United States--History--World War Two" with 3 facets, against "United States--History--World War 1" with 2 facets. Jaro-Winkler score: 0.963449163449 Levenshtein distance: 3 Merging to "United States--History--World War Two" with 5 facets. -- Comparing "United States--History--World War 1" with 3 facets, against "United States--History--World War 2" with 2 facets. Jaro-Winkler score: 0.980952380952 Levenshtein distance: 1 Keeping "United States--History--World War 1" with 3 facets, and "United States--History--World War 2" with 2 facets. --