blog.humaneguitarist.org

soup and sandwich: HTML to plain text with Lynx

[Sat, 13 May 2017 16:49:16 +0000]
The other day, I wrote a post about signature detection [http://blog.humaneguitarist.org/2017/05/06/spotting-the-john-hancock-in-emails/]. In that post, I mentioned that I'd soon write about how we are converting HTML emails to plain text. Even though the natural language processing stuff we're experimenting with doesn't care as much about accurately portrayed whitespace in the plain text output - for example a line break for a "br" tag, or spacing creating via CSS, etc. - the signature detection stuff we're trying is very dependent on accurate whitespace depiction. For example, say this is a signature block in an HTML email: John Smith Co-Founder and CEO Xxxxxxxxx Now, using run-of-the-mill things to convert this to plain text (BeautifulSoup [https://www.crummy.com/software/BeautifulSoup/], Bleach [http://bleach.readthedocs.io/en/latest/index.html], etc.) might work for this. But, really, I've seen emails where line breaks rely on CSS and these tools don't consider style code. I'm not saying these tools are designed for our needs. They're both great. They just weren't meant to do what we need done. And it's not just signature detection concerns. It's a matter of aesthetics, too. Who wants to read or preserve an email that has awkward li ne spacing created by post - processing ? Gross. I then tried some things specifically meant to provide plain text representations of HTML documents. I looked at html2text [https://pypi.python.org/pypi/html2text] for Python. That didn't do a satisfactory job for the test government emails we have. These emails are full of non-standard weirdness that tools like this weren't able to address. I then tried a custom PhantomJS script. For a while I was really pushing the PhantomJS thing [http://blog.humaneguitarist.org/2017/01/28/using-phantomjs-to-convert-html-email-with-boo-boos-to-plain-text/]. The output looked good. But it was damn slow and there would have been a lot of coding and debugging involved to use it as a fixed solution, speed issues aside. This is where it helped to have a co-worker push me. While I was focusing on a beautiful output, he was concerned about performance. And vice versa. So we agreed to look for more options, hoping to find something that addressed both of our concerns. I played a little with the node.js package node-html-to-text [https://github.com/werk85/node-html-to-text]. This, like html2text, outputs Markdown. It's been a while since I toyed with these, but I remember liking this one better vs. html2text. Problem is that Markdown isn't what we wanted. It's still markup; it's not real plain text. Interestingly, the documentation specifically mentions email conversion as an intended use. There's also textversionjs [https://github.com/EDMdesigner/textversionjs] which I haven't tried yet. And I may not. Because for now we have something that seems to be working well and rather fast. And that's the text browser Lynx [http://lynx.browser.org/]. Now, we still have requirements like the following: 1. move the links URLs into the text: i.e. <a href="foo">bar</a> to <a href="foo">bar [foo]</a> 2. remove all image tags. Lynx appears to actually output the image tag's alt text into the text representation, so we might want to check for an "alt" tag and move it into a text node before blowing the image tag away. Otherwise, we'll lose it. Anyway, this is where BeautifulSoup does comes in - to do some alteration per those requirements. For now, we've got a Python module that's doing the alteration and the Lynx conversion. Here's [https://github.com/StateArchivesOfNorthCarolina/tomes_tool/blob/a21f25d0db0b6f1c7e676e6060d0432f08af652b/lib/html_to_text.py] the latest version. If you have a file, "foo.html", that you wanted to convert, it could go like this in Python: # modify HTML. html = open("foo.html").read() html = ModifyHTML(html, "html5lib") #BeautifulSoup. html.shift_links() #rewrite links as in requirements, above. html.remove_images() #remove images as in requirements. html = html.raw() #back to string. # convert HTML to text. h2t = HTMLToText() plain = h2t.text(html, is_raw=True) print(plain) At least that's the idea for now. The idea for later might be to look into making this a full-on Python module one can just import via pip. But Lynx is written in C, so I'd have a lot to learn about how to do that. That's out-of-scope for the project I'm getting paid to do though, so it would just be something on the side.