blog.humaneguitarist.org

lost in spaCy: adventures in package installation

[Sun, 20 Nov 2016 17:13:17 +0000]
I recently started a new gig for which I'll mostly be doing development work. I get to work from home (or more correctly, home and the coffee shop) and perhaps I'll soon write about the pros and cons of working from home - it's not the paradise that I think I a lot of people might think it is. But as a solo guitarist, I'm more than used to the demands of working alone, intensely and for long periods of time. Anyway, one of the things I need to do for the job involves using some Natural Language Processing (NLP) tools. We're also required that we make things work with Python 2.7. One of the packages I need to look at is spaCy [https://spacy.io/]. I also want to look at textacy [http://textacy.readthedocs.io/en/latest/], largely built on top of spaCy. I had some serious installation problems with textacy and had earlier experienced memory issues with spaCy - as in it was taking up about of gig of RAM and grinding my laptop to a near halt. I still have an old warhorse of a laptop: a ThinkPad that's about 7 years old. It's heavy and has DisplayPort output instead of HDMI, etc. but it works great with Windows 10 and I really don't mind the extra effort required to walk to the coffee shop a few miles with it each day. It is a 32-bit machine so I can't increase the memory past the 4 gigs that it already has. The spaCy memory issue wasn't just frustrating in and of itself, but also because it was making me feel as if it was time to buy a new laptop. I don't want to do that, especially to try and resolve just one problem - not that I'd mind having a lighter machine. Back to the textacy installation issues ... $ pip install textacy kept resulting in errors installing some of the required modules [https://github.com/chartbeat-labs/textacy/blob/master/requirements.txt], so I decided to install them manually with pip. There were still a few for which I had to do something extra, so I'll just talk about those here. For "backports.csv" it seemed I needed to explicitly install version 1.0.1 ( $ pip install backports.csv==1.0.1 ), otherwise I got some dependency import error when trying to import textacy. For numpy-mkl, I went to a UC Irvine page [http://www.lfd.uci.edu/~gohlke/pythonlibs/] and downloaded the numpy-mkl wheel [http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy] for Python 2.7, Win 32. And then installed the downloaded wheel via pip. For scipy, I also used the UC Irvine scipy wheel [http://www.lfd.uci.edu/~gohlke/pythonlibs/#scipy] and installed the downloaded wheel via pip. I still couldn't get "cld2-cfii" installed, but someone on GitHub reported the same problem [https://github.com/chartbeat-labs/textacy/issues/5] and it looked like the package wasn't required provided I explicitly specified the language of my source document. So, I made sure textacy was uninstalled, downloaded the source from GitHub, removed the import of "cld2-cfii" as recommended in the issue report, deleted the "requirements.txt" file, and installed textacy ( $ [path to textacy source code]/python setup.py install ). After I did that, importing textacy seems to be working. Even importing spaCy now has improved performance and doesn't hog my memory. I'm suspecting that the numpy and/or scipy re-installation has something to do with that. Below is some sample code I used to test if textacy was even working. I took the snippets from the documentation [http://textacy.readthedocs.io/en/latest/api_reference.html#], but needed to make some minor changes because, I guess, I'm using Python 2.7 and I had to explicitly declare strings and even the language parameter as unicode. My printed results are a little different than in the documentation, but right now I was more focused on not getting errors. import textacy ### code snippets modified from: http://textacy.readthedocs.io/en/latest/api_reference.html# (retrieved November 19, 2016) # added unicode "u" prefix to meet spacy requirements content = (u'''The apparent symmetry between the quark and lepton families of the Standard Model (SM) are, at the very least, suggestive of a more fundamental relationship between them. In some Beyond the Standard Model theories, such interactions are mediated by leptoquarks (LQs): hypothetical color-triplet bosons with both lepton and baryon number and fractional electric charge.''') # lost the special character in the "title" during all the cutting and pasting metadata = { 'title': 'A Search for 2nd-generation Leptoquarks at vs = 7 TeV', 'author': 'Burton DeWilde', 'pub_date': '2012-08-01'} en = (u'en') # spacy requires unicode language code, i.e. not a string: en = ('en') doc = textacy.Doc(content, metadata=metadata, lang=en) print(doc) print doc.to_bag_of_words(lemmatize=False, as_strings=False) print doc.to_bag_of_terms(ngrams=2, named_entities=True, lemmatize=True, as_strings=True) print doc[49] print doc[:3] ... Update, November 24, 2016: OK, I was totally wrong. There were still memory issues with spaCy. Luckily, I found out that my laptop actually was 64-bit compatible. So, I was able to use this tutorial [http://www.windowscentral.com/how-upgrade-32-bit-64-bit-version-windows-10] and upgrade to Windows Pro 64 for free. It was an all day affair because I wiped my hard drives and had to re-install everything (and remember to use 64-bit versions of LibreOffice and Firefox, etc.). From there I was able to upgrade to 8 gigs of RAM, the max my processor [http://processors.specout.com/l/986/Intel-Core-i5-520M] supports. Other than time of labor, it was only $50.00 to upgrade. That's certainly cheaper than buying an entirely new laptop. I would have hated to let go of this Lenovo T510 even though it's almost 7 years old [http://blog.humaneguitarist.org/2010/04/25/using-expression-encoder-3-to-create-wmv-flash-and-ogg-theora-screencasts/]. It just works. ps: Happy Thanksgiving to those in the states who celebrate.