blog.humaneguitarist.org

a term extraction humbug

[Fri, 25 Dec 2015 01:48:01 +0000]
In a previous post [http://blog.humaneguitarist.org/2015/07/02/i-kea-you-maui-some-term-extraction-distraction/], I'd written a little bit about term extraction and included a link to a tutorial for Maui and RAKE. Since then, one now needs to log in through GitHub in order to see all the code samples it seems ... Anyway, I'd also said that I think it would be nice to have a term extractor for Library of Congress Subject Headings (LCSH). That would be somewhat of a "holy grail" in the library world, I'd suspect. I don't think it can replace human cataloging anytime soon, but it might be useful. Since that post, I downloaded the LCSH headings from the Library of Congress site, cut out all but non-complex subjects, and made a SKOS file of the results. That took my laptop about 5 hours to make. I made the SKOS file so I could point Maui at it so that it would return LCSH recommendations against a given text. Maui seemed to load the SKOS file each time I extracted LCSH terms for a given text. I didn't like that because it was slow and while I messed around a little looking through old Google Groups posts for more information, I've moved on to something else - for now at least. As for Maui, certainly it would be worth investigating using the Maui web service [http://entopix.com/maui/] in conjunction with the SKOS file I made, but here's what I've been playing around with ... Unlike Maui, RAKE is fast and has Python implementations (I seemed to get better results with this one [https://github.com/aneesha/RAKE] instead of this one [https://github.com/tomaspinho/python-rake]). So I thought maybe something could be done with Python and a SQLite database. In order to do this, I needed a set of training documents and corresponding subjects. That's to say for each document in a set, I needed man-made LCSH terms assigned to the text apart from the document text itself. The idea was that I'd take the RAKE output, correlate that data with the LCSH terms, and write some code so that if one passed a text string, the code would get the RAKE output for that text, compare it to stored RAKE outputs from the training data, and return possible LCSH terms for the inputted text. But I ran into some problems. Namely, the only free document set with LCSH terms I could find was stuff on Project Gutenberg. Using the Gutenberg plain text version of some documents and the LCSH terms in the RDF metadata [https://www.gutenberg.org/wiki/Gutenberg:Feeds#The_Complete_Project_Gutenberg_Catalog], I figured I could get my hands on some training data and do the following: 1. Make a list of Gutenberg identifiers for the first 1000 English documents. 1. I got this from the RDF files I downloaded. 2. Extract the LCSH subjects from the RDF files, keeping only the major part of the subject heading. For example, I'd convert "Cats--Diseases" to just "Cats" because the SKOS file I made had no complex subjects. 3. Download a copy of the plain text version of the document. 4. Remove all the text from the document that was about Project Gutenberg. 5. Run the Python RAKE function on the text and store the data I needed to SQLite. Steps 1 and 2 were easy enough - just busy work. But I was having trouble with Step 3 because I never got anything back after filling in an online form linked to here [http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project#Creating_a_Custom_CD_or_DVD] to get a custom DVD of Gutenberg docs based on a list of identifiers. So then I figured I could download the docs from a mirror site, but I would have asked as a courtesy first and I also thought that might take a ton of processing time. Step 4 was annoying, but I wrote a little function to more or less remove all the extraneous Gutenberg-specific text. Step 5 was going to be a problem though because I found running RAKE on a few sample Gutenberg texts was a very slow process. I was starting to feel like I needed to get a cloud machine to do this work. Also, I was concerned that many of the Gutenberg documents were fiction and I didn't think that was going to be good to train the tool. But then I had a revelation. I didn't need to use LCSH because getting the sample data was a separate problem than just trying to make the tool and test it and learn a little more about all this stuff - which was my goal. I already had a document set using a non-LCSH controlled vocabulary ... My blog. Sure, the tags are often silly and non-sensical, but they are controlled in the sense that they all have unique identifiers no matter how many times the tag is used throughout the blog. So I exported my blog to XML with the WordPress Dashboard interface and parsed the data, removing certain HTML elements like "table" and "pre" in order to remove tabular data and code snippets, etc. I placed the data in the database and I'm getting some interesting results. For example, consider one of my first posts [http://blog.humaneguitarist.org/2009/08/09/xslt-transformations-more-than-meets-the-eye/]. I'd assigned it the following tags: * JHOVE * MusicXML * TEI * XML * XSLT * Zotero I figured the best early test - if I was even somewhat sort of in the relatively nearby ballpark for a similar sport in an adjacent planet - was that I could train the database with this document, re-feed the post text to the tool and get these terms back. And I did as such: [{ "score" : 1.0, "label" : "JHOVE", "id" : "36", "uri" : "http://blog.humaneguitarist.org/tag/jhove/" }, { "score" : 1.0, "label" : "MusicXML", "id" : "53", "uri" : "http://blog.humaneguitarist.org/tag/musicxml/" }, { "score" : 1.0, "label" : "TEI", "id" : "76", "uri" : "http://blog.humaneguitarist.org/tag/tei/" }, { "score" : 1.0, "label" : "XSLT", "id" : "89", "uri" : "http://blog.humaneguitarist.org/tag/xslt/" }, { "score" : 1.0, "label" : "XML", "id" : "9", "uri" : "http://blog.humaneguitarist.org/tag/xml/" }, { "score" : 1.0, "label" : "Zotero", "id" : "91", "uri" : "http://blog.humaneguitarist.org/tag/zotero/" } ] From then, I altered to the text by removing all the six tag phrases from the text. That's to say that, for example, I removed "Zotero" from the text because that same string appeared in the list of tags. I did the same for the remaining five tags. I then fed the edited text to the tool. The result was that all the scores values, which had been "1.0", dropped to "0.716". Fair enough, it wasn't the exact same text. It was just similar. When I proceeded to train the tool with 10 blog posts instead of just one, I got the same results for both the original and altered text. But when I trained it with all the nearly 200 blog posts I've written, I started seeing more interesting results. Now, when I sent the original text to the tool, my original six tags appeared at the top though the score for "MusicXML" dropped from "1.0" to ".525". There are also some new tags, although with weaker scores. [{ "score" : 1.0, "label" : "JHOVE", "id" : "36", "uri" : "http://blog.humaneguitarist.org/tag/jhove/" }, { "score" : 1.0, "label" : "TEI", "id" : "76", "uri" : "http://blog.humaneguitarist.org/tag/tei/" }, { "score" : 1.0, "label" : "XSLT", "id" : "89", "uri" : "http://blog.humaneguitarist.org/tag/xslt/" }, { "score" : 1.0, "label" : "XML", "id" : "9", "uri" : "http://blog.humaneguitarist.org/tag/xml/" }, { "score" : 1.0, "label" : "Zotero", "id" : "91", "uri" : "http://blog.humaneguitarist.org/tag/zotero/" }, { "score" : 0.525, "label" : "MusicXML", "id" : "53", "uri" : "http://blog.humaneguitarist.org/tag/musicxml/" }, { "score" : 0.16, "label" : "slashes", "id" : "144", "uri" : "http://blog.humaneguitarist.org/tag/slashes/" }, { "score" : 0.16, "label" : "command line", "id" : "20", "uri" : "http://blog.humaneguitarist.org/tag/command-line/" }, { "score" : 0.114, "label" : "Summon", "id" : "465", "uri" : "http://blog.humaneguitarist.org/tag/summon/" }, { "score" : 0.094, "label" : "JavaScript", "id" : "132", "uri" : "http://blog.humaneguitarist.org/tag/javascript/" }, { "score" : 0.052, "label" : "APIs", "id" : "122", "uri" : "http://blog.humaneguitarist.org/tag/apis/" }, { "score" : 0.047, "label" : "syntax highlighting", "id" : "131", "uri" : "http://blog.humaneguitarist.org/tag/syntax-highlighting/" }, { "score" : 0.047, "label" : "Indian food", "id" : "133", "uri" : "http://blog.humaneguitarist.org/tag/indian-food/" }, { "score" : 0.047, "label" : "blogging", "id" : "19", "uri" : "http://blog.humaneguitarist.org/tag/blogging/" }, { "score" : 0.047, "label" : "WordPress", "id" : "94", "uri" : "http://blog.humaneguitarist.org/tag/wordpress/" }, { "score" : 0.044, "label" : "visualization", "id" : "145", "uri" : "http://blog.humaneguitarist.org/tag/visualization/" }, { "score" : 0.044, "label" : "Graphviz", "id" : "146", "uri" : "http://blog.humaneguitarist.org/tag/graphviz/" }, { "score" : 0.044, "label" : "DOT", "id" : "177", "uri" : "http://blog.humaneguitarist.org/tag/dot/" }, { "score" : 0.044, "label" : "sinuses", "id" : "179", "uri" : "http://blog.humaneguitarist.org/tag/sinuses/" }, { "score" : 0.044, "label" : "PHP", "id" : "59", "uri" : "http://blog.humaneguitarist.org/tag/php/" }, { "score" : 0.04, "label" : "Netflix", "id" : "121", "uri" : "http://blog.humaneguitarist.org/tag/netflix/" }, { "score" : 0.032, "label" : "LilyPond", "id" : "42", "uri" : "http://blog.humaneguitarist.org/tag/lilypond/" }, { "score" : 0.032, "label" : "MusicSQL", "id" : "52", "uri" : "http://blog.humaneguitarist.org/tag/musicsql/" }, { "score" : 0.032, "label" : "MySQL", "id" : "56", "uri" : "http://blog.humaneguitarist.org/tag/mysql/" }, { "score" : 0.032, "label" : "Python", "id" : "64", "uri" : "http://blog.humaneguitarist.org/tag/python/" }, { "score" : 0.029, "label" : "Hammer Films", "id" : "123", "uri" : "http://blog.humaneguitarist.org/tag/hammer-films/" }, { "score" : 0.029, "label" : "Sean Connery", "id" : "190", "uri" : "http://blog.humaneguitarist.org/tag/sean-connery/" }, { "score" : 0.029, "label" : "HammerFlix", "id" : "198", "uri" : "http://blog.humaneguitarist.org/tag/hammerflix/" }, { "score" : 0.026, "label" : "Flash", "id" : "195", "uri" : "http://blog.humaneguitarist.org/tag/flash/" }, { "score" : 0.026, "label" : "ActionScript", "id" : "286", "uri" : "http://blog.humaneguitarist.org/tag/actionscript/" }, { "score" : 0.02, "label" : "business", "id" : "205", "uri" : "http://blog.humaneguitarist.org/tag/business/" }, { "score" : 0.02, "label" : "Qwikster", "id" : "207", "uri" : "http://blog.humaneguitarist.org/tag/qwikster/" }, { "score" : 0.014, "label" : "automation", "id" : "14", "uri" : "http://blog.humaneguitarist.org/tag/automation/" }, { "score" : 0.014, "label" : "Excel", "id" : "26", "uri" : "http://blog.humaneguitarist.org/tag/excel/" }, { "score" : 0.014, "label" : "OpenOffice", "id" : "58", "uri" : "http://blog.humaneguitarist.org/tag/openoffice/" }, { "score" : 0.014, "label" : "VBA", "id" : "78", "uri" : "http://blog.humaneguitarist.org/tag/vba/" }, { "score" : 0.014, "label" : "Visual Basic", "id" : "81", "uri" : "http://blog.humaneguitarist.org/tag/visual-basic/" }, { "score" : 0.013, "label" : "term extraction", "id" : "289", "uri" : "http://blog.humaneguitarist.org/tag/term-extraction/" }, { "score" : 0.013, "label" : "dbpedia", "id" : "295", "uri" : "http://blog.humaneguitarist.org/tag/dbpedia/" }, { "score" : 0.013, "label" : "LCSH", "id" : "498", "uri" : "http://blog.humaneguitarist.org/tag/lcsh/" }, { "score" : 0.002, "label" : "metasearch", "id" : "487", "uri" : "http://blog.humaneguitarist.org/tag/metasearch/" }, { "score" : 0.002, "label" : "TLDR", "id" : "508", "uri" : "http://blog.humaneguitarist.org/tag/tldr/" } ] That's kind of interesting. Kind of. Just to look a little deeper, I wanted to know how often the six tags in question appear in all of my blog. The table below has the tag and the number of times it appears in my blog as a tag. tag count JHOVE 1 TEI 1 Zotero 1 XML 2 XSLT 7 MusicXML 9 With "MusicXML" appearing more often in my blog, I might want to consider altering the SQL query to increase its score, rather than punish it, for appearing more often. I'll have to look into my queries and the database structure more, but for now I just threw enough mud at the wall to get something interesting to happen. I'm sure I'll work more on this. Or not. For now, I'm going to assign tags to this post (how ironic), workout, and then go out. Merry Christmas. And here [https://archive.org/details/otr_achristmascarol]'s a radio play version of A Christmas Carol with Ralph Richardson [https://en.wikipedia.org/wiki/Ralph_Richardson] if anyone's interested. ... Update, December 25, 2015: I forgot to mention something. As I mentioned, I couldn't get my hands on a document set with assigned LCSH terms, so one of my earlier thoughts was to create a web interface where one could paste a text document and assign it LCSH terms (from the set I extracted into a SKOS file). When the document and terms were submitted through the interface, the data would be stored to a database. The idea was that a few people could casually paste in documents here and there and assign LCSH terms from a controlled list. The terms were entered via an auto-complete mechanism so there was no way to manually create subjects. After enough documents were entered the training process could be executed. That was the idea but I still think there might be something to crowd-sourcing this kind of thing with a small group of librarians. But for now, like I said, the real need is to just test the logic with any controlled vocabulary. I've pasted screenshots below of the text input screen, the subject assignment screen, and the submission page. Text Input: IMAGE: "text input screenshot"[http://blog.humaneguitarist.org/uploads/nessie_input.png] Subject Assignment: IMAGE: "subject assignment screenshot"[http://blog.humaneguitarist.org/uploads/nessie_subjects.png] Document Submission: IMAGE: "document submission screenshot"[http://blog.humaneguitarist.org/uploads/nessie_submit.png] ... Another update, December 25, 2015: OK, so I altered the query a bit and now "MusicXML" is rewarded for being more prominent in the blog (i.e. repeatedly correlated with certain RAKE terms). [{ "score" : 1.032, "label" : "MusicXML", "id" : "53", "uri" : "http://blog.humaneguitarist.org/tag/musicxml/" }, { "score" : 1.0, "label" : "JHOVE", "id" : "36", "uri" : "http://blog.humaneguitarist.org/tag/jhove/" }, { "score" : 1.0, "label" : "TEI", "id" : "76", "uri" : "http://blog.humaneguitarist.org/tag/tei/" }, { "score" : 1.0, "label" : "XSLT", "id" : "89", "uri" : "http://blog.humaneguitarist.org/tag/xslt/" }, { "score" : 1.0, "label" : "XML", "id" : "9", "uri" : "http://blog.humaneguitarist.org/tag/xml/" }, { "score" : 1.0, "label" : "Zotero", "id" : "91", "uri" : "http://blog.humaneguitarist.org/tag/zotero/" }, { "score" : 0.16, "label" : "slashes", "id" : "144", "uri" : "http://blog.humaneguitarist.org/tag/slashes/" }, { "score" : 0.16, "label" : "command line", "id" : "20", "uri" : "http://blog.humaneguitarist.org/tag/command-line/" }, { "score" : 0.114, "label" : "Summon", "id" : "465", "uri" : "http://blog.humaneguitarist.org/tag/summon/" }, { "score" : 0.073, "label" : "JavaScript", "id" : "132", "uri" : "http://blog.humaneguitarist.org/tag/javascript/" }, { "score" : 0.055, "label" : "APIs", "id" : "122", "uri" : "http://blog.humaneguitarist.org/tag/apis/" }, { "score" : 0.049, "label" : "Netflix", "id" : "121", "uri" : "http://blog.humaneguitarist.org/tag/netflix/" }, { "score" : 0.047, "label" : "syntax highlighting", "id" : "131", "uri" : "http://blog.humaneguitarist.org/tag/syntax-highlighting/" }, { "score" : 0.047, "label" : "Indian food", "id" : "133", "uri" : "http://blog.humaneguitarist.org/tag/indian-food/" }, { "score" : 0.047, "label" : "blogging", "id" : "19", "uri" : "http://blog.humaneguitarist.org/tag/blogging/" }, { "score" : 0.047, "label" : "WordPress", "id" : "94", "uri" : "http://blog.humaneguitarist.org/tag/wordpress/" }, { "score" : 0.044, "label" : "visualization", "id" : "145", "uri" : "http://blog.humaneguitarist.org/tag/visualization/" }, { "score" : 0.044, "label" : "Graphviz", "id" : "146", "uri" : "http://blog.humaneguitarist.org/tag/graphviz/" }, { "score" : 0.044, "label" : "DOT", "id" : "177", "uri" : "http://blog.humaneguitarist.org/tag/dot/" }, { "score" : 0.044, "label" : "sinuses", "id" : "179", "uri" : "http://blog.humaneguitarist.org/tag/sinuses/" }, { "score" : 0.044, "label" : "PHP", "id" : "59", "uri" : "http://blog.humaneguitarist.org/tag/php/" }, { "score" : 0.032, "label" : "LilyPond", "id" : "42", "uri" : "http://blog.humaneguitarist.org/tag/lilypond/" }, { "score" : 0.032, "label" : "MusicSQL", "id" : "52", "uri" : "http://blog.humaneguitarist.org/tag/musicsql/" }, { "score" : 0.032, "label" : "MySQL", "id" : "56", "uri" : "http://blog.humaneguitarist.org/tag/mysql/" }, { "score" : 0.032, "label" : "Python", "id" : "64", "uri" : "http://blog.humaneguitarist.org/tag/python/" }, { "score" : 0.029, "label" : "Hammer Films", "id" : "123", "uri" : "http://blog.humaneguitarist.org/tag/hammer-films/" }, { "score" : 0.029, "label" : "Sean Connery", "id" : "190", "uri" : "http://blog.humaneguitarist.org/tag/sean-connery/" }, { "score" : 0.029, "label" : "HammerFlix", "id" : "198", "uri" : "http://blog.humaneguitarist.org/tag/hammerflix/" }, { "score" : 0.026, "label" : "Flash", "id" : "195", "uri" : "http://blog.humaneguitarist.org/tag/flash/" }, { "score" : 0.026, "label" : "ActionScript", "id" : "286", "uri" : "http://blog.humaneguitarist.org/tag/actionscript/" }, { "score" : 0.02, "label" : "business", "id" : "205", "uri" : "http://blog.humaneguitarist.org/tag/business/" }, { "score" : 0.02, "label" : "Qwikster", "id" : "207", "uri" : "http://blog.humaneguitarist.org/tag/qwikster/" }, { "score" : 0.014, "label" : "automation", "id" : "14", "uri" : "http://blog.humaneguitarist.org/tag/automation/" }, { "score" : 0.014, "label" : "Excel", "id" : "26", "uri" : "http://blog.humaneguitarist.org/tag/excel/" }, { "score" : 0.014, "label" : "OpenOffice", "id" : "58", "uri" : "http://blog.humaneguitarist.org/tag/openoffice/" }, { "score" : 0.014, "label" : "VBA", "id" : "78", "uri" : "http://blog.humaneguitarist.org/tag/vba/" }, { "score" : 0.014, "label" : "Visual Basic", "id" : "81", "uri" : "http://blog.humaneguitarist.org/tag/visual-basic/" }, { "score" : 0.013, "label" : "term extraction", "id" : "289", "uri" : "http://blog.humaneguitarist.org/tag/term-extraction/" }, { "score" : 0.013, "label" : "dbpedia", "id" : "295", "uri" : "http://blog.humaneguitarist.org/tag/dbpedia/" }, { "score" : 0.013, "label" : "LCSH", "id" : "498", "uri" : "http://blog.humaneguitarist.org/tag/lcsh/" }, { "score" : 0.002, "label" : "metasearch", "id" : "487", "uri" : "http://blog.humaneguitarist.org/tag/metasearch/" }, { "score" : 0.002, "label" : "TLDR", "id" : "508", "uri" : "http://blog.humaneguitarist.org/tag/tldr/" } ] ... Update, April 12, 2016: I should mention that I've yet again changed the query so that the scores always are within the range of ZERO to ONE as it should be. I also think I might be better addressing suggesting terms aligned with RAKE terms across the entire body of the training documents. Now, I'm getting results as such ... { "score": 1.0, "label": "JHOVE", "id": 36, "uri": "http://blog.humaneguitarist.org/tag/jhove/" }, { "score": 1.0, "label": "TEI", "id": 76, "uri": "http://blog.humaneguitarist.org/tag/tei/" }, { "score": 1.0, "label": "Zotero", "id": 91, "uri": "http://blog.humaneguitarist.org/tag/zotero/" }, { "score": 0.927, "label": "XML", "id": 9, "uri": "http://blog.humaneguitarist.org/tag/xml/" }, { "score": 0.83, "label": "XSLT", "id": 89, "uri": "http://blog.humaneguitarist.org/tag/xslt/" }, { "score": 0.753, "label": "MusicXML", "id": 53, "uri": "http://blog.humaneguitarist.org/tag/musicxml/" }