a quackish approach to sorting federerated search results

[Tue, 25 Nov 2014 16:45:05 +0000]
Over two years ago I posted this post [] with regard to sorting results in a federated search. That's to say I've been wondering how to logically sort the results from more than one search source (local database, external search API, etc.) into one set of results for some time. The approach I outlined in that post still makes sense to me, but what I've always been thinking about was how to sort two or more sets where the final sort respects the original sort order of each set. Let me explain using the same interview candidate situation I used two years ago. "Look, I've asked these different sources to send me their best stuff and now I'll have a way to rank them with my own criteria … because they're mine now." It's like using your own criteria to rank job candidates after asking a few of your industry friends to each send in their five best employees for the job you're hiring for. You're not necessarily going to agree with how they rank their own employees but you do trust that they've sent you five top notch folks. Say I'm interviewing people and have 3 candidates each from two referring employers, Hansel and Gretel. Hansel Gretel Hansel_1 Gretel_1 Hansel_2 Gretel_2 Hansel_3 Gretel_3 In the table above, Hansel and Gretel have already ranked their candidates in order of their preference, but after I've got their "best", I'll assign my own "interview score" after I talk to them, etc. candidate interview_score referring_employer Hansel_1 3 Hansel Hansel_2 0 Hansel Hansel_3 1 Hansel Gretel_1 4 Gretel Gretel_2 2 Gretel Gretel_3 0 Gretel In terms of search results, this score could be relevance calculated by in-memory SQLite (as I mentioned in the earlier post), edit distance between the user query versus the title of the item, or whatever. For the purpose of explanation it doesn't matter how the score is calculated, it only matters that there is a score and that it's assigned by me, the interviewer, and not by Hansel and Gretel. With a spreadsheet application, I could simply rank the final candidates by my interview score (descending) as in Sort 1, below. SORT 1 candidate interview_score referring_employer Gretel_1 4 Gretel Hansel_1 3 Hansel Gretel_2 2 Gretel Hansel_3 1 Hansel Hansel_2 0 Hansel Gretel_3 0 Gretel That can certainly work. I'm the interviewer and I have the final say. I can also sort by my interview score (descending) AND referring employer (ascending) if I think Gretel is a better employer than Hansel. This puts Gretel_3 above Hansel_2 even though they both have an interview score of "0". SORT 2 candidate interview_score referring_employer Gretel_1 4 Gretel Hansel_1 3 Hansel Gretel_2 2 Gretel Hansel_3 1 Hansel Gretel_3 0 Gretel Hansel_2 0 Hansel Both sorts are fine in the interview example, but in terms of search results I wanted to do something similar but while still respecting the original sort order. In other words, I wanted to sort a set of all candidates without overriding Hansel and Gretel's original sort order. That sort would be as in Sort 3, below. SORT 3 candidate interview_score referring_employer Gretel_1 4 Gretel Hansel_1 3 Hansel Gretel_2 2 Gretel Gretel_3 0 Gretel Hansel_2 0 Hansel Hansel_3 1 Hansel If I pull Sort 3 apart, then it's easy to see that the original sort order per set is preserved. candidate candidate interview_score referring_employer Gretel_1 4 Gretel Hansel_1 3 Hansel Gretel_2 2 Gretel Gretel_3 0 Gretel Hansel_2 0 Hansel Hansel_3 1 Hansel If these were search results, that would allow results to be mixed together while preserving the original sort order returned by the Hansel and Gretel databases. So, instead of a candidate name like "Gretel_1" that data point would not be a person's name but rather descriptive metadata (title, description, etc.) about the most relevant item in the Gretel database for a given search. It took me a long time to figure out an algorithm that could sort two arrays in the manner above. I've been thinking about it here and there since I wrote that post two years ago. I came close a couple of months ago but the code was hard to read, over-written, and buggy. It didn't work because the algorithm didn't work either. It wasn't until last week that I think I got there in terms of a working solution ... or at least I'm much closer. What changed last week was that I was reading about sorting algorithms in Discrete Mathematics And Its Applications [] (7th Edition). While reading about a "bubble sort" I came up with a much, much simpler solution to my problem. It seems similar to a merge sort [] but with the stipulation as above where I have 2 sets and need to sort the results while still respecting the original sort order of each set. The algorithm basically works this way to sort from two ordered sets (we'll call them X and Y): 1. create an empty set, Output. 2. while both the ordered sets X and Y are not empty do the following: 1. compare the current first item in set Y with the current first item in set X. 1. if the score of the Y item is greater than or equal to the score of the X item, append the Y item to Output and then remove the Y item from set Y. 2. otherwise append the X item to Output and remove the X item from set X. 3. append all remaining items in set Y to Output. 4. append all remaining items in set X to Output. 5. return the Output set. Note that the "equal to" condition in step 2.1.1 favors the second set over the first. In other words, if the "equal to" condition were removed, Sort 3 would place Gretel_3 in last place because Hansel_2 still MUST be ranked higher than Hansel_3 because of Hansel's ordering, but Gretel_3's score of "0" is less than Hansel_3's score of 1. I'm suspicious that I missed something and that this has a fatal flaw, but for now here's a Python implementation below that uses the sample table data above to output the results as I wanted in Sort 3. ps: until I get embarrassed and learn this algorithm already exists and has a name, I'll call this "duck sort" because the interview analogy always reminds me of this classic Tracey Ullman Show [] skit with Dan Castellaneta, the voice of Homer Simpson. IFRAME: // Part 2 of the clip is here: []. Python function: def duck_sort(x,y): output = [] while len(y) > 0 and len(x) > 0: first_x = x[0] first_y = y[0] if first_y[1] >= first_x[1]: output.append(first_y) y.remove(first_y) else: output.append(first_x) x.remove(first_x) output = output + y output = output + x return output using the function in the IDLE Python interpreter: >>> Hansel = [("Hansel_1", 3), ("Hansel_2", 0), ("Hansel_3", 1)] >>> Gretel = [("Gretel_1", 4), ("Gretel_2", 2), ("Gretel_3", 0)] >>> # to preserve lists Hansel and Gretel, do: >>> # results = duck_sort(list(Hansel), list(Gretel)) >>> results = duck_sort(Hansel, Gretel) >>> for result in results: print result ('Gretel_1', 4) ('Hansel_1', 3) ('Gretel_2', 2) ('Gretel_3', 0) ('Hansel_2', 0) ('Hansel_3', 1) ... Update: Clearly I meant the title of this post to have the phrase "federated search" instead of "federerated search", but it must have been a Freudian slip. After all, Roger Federer did just help the Swiss win the Davis Cup!