I KEA, you Maui: some term extraction distraction

I was at the ALA annual conference [http://alaac15.ala.org/] last week. I came back from San Francisco (SF) on Tuesday morning. I left SF at 9pm their time and came back Tuesday at 9am Charleston time. In my age, I've attained just enough street smarts to have taken the rest of the week off. The highlight of my trip had nothing to do with work. I was able to catch up with my friend Jan, an actress and singer who moved to SF over 5 years ago from Charleston. If you're in the SF area, check out Killing My Lobster [http://www.killingmylobster.com]. After the show, I threw a sketch idea out to the writers regarding librarians and "shh"-ing, so hopefully that idea takes at some point. In terms of work, I had a lot of good conversations with potential customers as well as other vendors. I think I even convinced one vendor to improve one of their products in a way I'd like to see. I can't say more than that. What I will say is that if one wants to see change, you've got to have an angle, a way of showing people that what you're advocating will benefit them and not just you. Anyway, enough of that. The other interesting conversation was with someone who worked for a company that creates metadata for books, aligning them to controlled learning curriculum vocabularies, age appropriacy, reading levels, etc. Actually, she doesn't just work for them. She's the owner of the company, so she works for herself. I'd asked if there was any automated extraction of the controlled vocabulary terms prior to the manual editing. She said there was and that the technology was developed in-house. All the while I was thinking of something I've been meaning to investigate more ever since I learned about KEA [http://www.nzdl.org/Kea/index_old.html] and had started playing around with term extraction service like Open Calais and Alchemy API. I'd even suggested a while back that the DPLA (which really should be called the Public Digital Library of America but I digress ...) provide a service like this. And, of course, it needs a silly name. I suggested DPLAgänger and I was kinda/pretty serious. DPLA should leverage its metadata and accompanying Subject terms to have a webs service that can return Subject terms based on the text submitted (a la Open Calais or Alchemy API or others). Should return terms using different library-land taxonomies. Maybe look into using it with Kea: http://www.nzdl.org/Kea/ DPLAgänger? (you wanted pithy) I'm thinking about doing it. This would be a tool for lazy people (like me) who want OK but not great automated subject terms to be generated. Nitin Arora nitin@nclive.org NC LIVE from: http://dp.la/info/developers/ideas-and-projects/current-ideas/ [http://dp.la/info/developers/ideas-and-projects/current-ideas/] I never got anywhere with KEA nor did I try, but a few months ago I saw this page: NLP keyword extraction tutorial with RAKE and Maui [https://www.airpair.com/nlp/keyword-extraction-tutorial]. The page links to a Python term extractor and a Java one called Maui which is the second coming of KEA. Unlike the Python one (RAKE), Maui can extract terms from a controlled vocabulary as notated in things like SKOS files. Obviously, this has some serious implications for automating Library of Congress topical, geographic, temporal terms, etc. against a given text. These things aren't perfect and I wouldn't implicitly trust them to, say, auto-create LCSH headings in MARC records, but it's something to think about. It could certainly be used to link one's digital collections to other stuff on the web without the API usage restrictions that are imposed by 3rd party services like Open Calais (which doesn't provide LCSH headings last time I checked). The tutorial [https://www.airpair.com/nlp/keyword-extraction-tutorial] is really good, so I don't really need to add to it. I'll just add that you can get several SKOS vocabularies at http://www.w3.org/2001/sw/wiki/SKOS/Datasets [http://www.w3.org/2001/sw/wiki/SKOS/Datasets]. I was able to have some fun with the New York Times subject descriptors [http://data.nytimes.com/descriptors.rdf] but the LC subject headings SKOS file [http://id.loc.gov/static/data/authoritiessubjects.rdfxml.skos.zip] is really large (consider yourself warned), so I'm not going to play with it just yet. It's got some encoding things I need to look into and my computer just freezes up on that file due to the size. Anyway, so if I train Maui per the tutorial to use the NY Times SKOS file like so: java -Xmx1024m -jar maui-standalone-1.1-SNAPSHOT.jar train -l data/docs/fao_train/ -m data/models/term_assignment_model -v ../descriptors.rdf -f skos I get some errors like so: 02 Jul 2015 12:01:28 WARN MauiFilter - Warning! This documents does not contain valid keyphrases 02 Jul 2015 12:01:28 WARN MauiFilter - Agricultural development Development policies Rural development Socioeconomic development Structural change 02 Jul 2015 12:01:28 WARN MauiFilter - Warning! This documents does not contain valid keyphrases ... I haven't investigated what those errors - at the end of the day, it's not as important as Music or reading to me and I'd ideally like to get someone else to do it (see earlier comment about convincing people to do what you want). But errors aside, I ran Maui on one of the test documents from the tutorial like so: java -Xmx1024m -jar maui-standalone-1.1-SNAPSHOT.jar run data/docs/fao_test/w2167e.txt -m data/models/term_assignment_model -v ../descriptors.rdf -f skos and here was the output: 02 Jul 2015 11:59:18 INFO Vocabulary - --- Loading RDF model from the SKOS file... 02 Jul 2015 11:59:19 INFO Vocabulary - --- Building the Vocabulary index from the RDF model... 02 Jul 2015 11:59:19 INFO Vocabulary - --- Statistics about the vocabulary: 02 Jul 2015 11:59:19 INFO Vocabulary - 498 terms in total 02 Jul 2015 11:59:19 INFO Vocabulary - 0 non-descriptive terms 02 Jul 2015 11:59:19 INFO Vocabulary - 0 terms have related terms Keyword: Food 0.011338100102145046 Keyword: Theater 0.0022471910112359544 Keyword: Education and Schools 0.0022471910112359544 Keyword: Child Care 0.0022471910112359544 Keyword: Trees and Shrubs 0.0022471910112359544 Keyword: Sociology 0.0022471910112359544 Keyword: Wines 0.0022471910112359544 Keyword: Science and Technology 0.0022471910112359544 Keyword: Heart 0.0022471910112359544 Keyword: Evolution 0.0022471910112359544 Now, things get really interesting if I look up the element in the NY Times SKOS file for the SKOS label of "Food", which was the highest ranking term that was extracted. <rdf:Description rdf:about="http://data.nytimes.com/66708909499644641280"> <skos:scopeNote xml:lang="en">Used for articles that discuss the topic of food in general rather than focusing on any specific kind of food, or any of the more specific food topics listed under Related Terms.<br></skos:scopeNote> <owl:sameAs rdf:resource="http://data.nytimes.com/food_des"/> <skos:definition xml:lang="en">News and reviews of street-food vendors in New York.</skos:definition> <owl:sameAs rdf:resource="http://rdf.freebase.com/ns/en.food_industry"/> <nyt:latest_use rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2010-06-14</nyt:latest_use> <nyt:number_of_variants rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</nyt:number_of_variants> <owl:sameAs rdf:resource="http://dbpedia.org/resource/Food"/> <nyt:associated_article_count rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1949</nyt:associated_article_count> <nyt:search_api_query rdf:datatype="http://www.w3.org/2001/XMLSchema#string">http://api.nytimes.com/svc/search/v1/article?query=+nytd_des_facet%3A%5BFood%5D&rank=newest&fields=abstract,author,body,byline,classifiers_facet,column_facet,date,day_of_week_facet,des_facet,desk_facet,fee,geo_facet,lead_paragraph,material_type_facet,multimedia,nytd_byline,nytd_des_facet,nytd_geo_facet,nytd_lead_paragraph,nytd_org_facet,nytd_per_facet,nytd_section_facet,nytd_title,nytd_works_mentioned_facet,org_facet,page_facet,per_facet,publication_day,publication_month,publication_year,section_page_facet,small_image_height,small_image_url,small_image_width,source_facet,title,url,word_count,works_mentioned_facet</nyt:search_api_query> <nyt:topicPage rdf:resource="http://topics.nytimes.com/top/reference/timestopics/subjects/s/street_food_in_new_york/index.html"/> <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept"/> <skos:inScheme rdf:resource="http://data.nytimes.com/elements/nytd_des"/> <nyt:first_use rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2004-09-03</nyt:first_use> <skos:prefLabel xml:lang="en">Food</skos:prefLabel> </rdf:Description> Because the NY Times has mapped other vocabularies to theirs, now there's the possibility of, say, going to the DBpedia page notated in this element: <owl:sameAs rdf:resource="http://dbpedia.org/resource/Food"/> From there, I've got more data to think about playing with, including a thumbnail image for DBpedia's concept of "Food" ... all through automation, of course. IMAGE: "DBpedia image for food"[https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Good_Food_Display_-_NCI_Visuals_Online.jpg/300px-Good_Food_Display_-_NCI_Visuals_Online.jpg] But there are more important things to do, like listening to the radio feed for Wimbledon (Go Federer [http://dbpedia.org/page/Roger_Federer]).