blog.humaneguitarist.org
okra pie: the actual code I forgot to post
[Sun, 01 Sep 2013 15:08:37 +0000]
A while ago I had done a simple test here [http://blog.humaneguitarist.org/2012/07/14/okra-pie-some-simple-ocrhocr-tests/] using Tesseract's HOCR output and ImageMagick to overlay - with CSS - invisible OCR text on top of images in an HTML file.
This was a very simple way to generate a page that one could then simply use their browser's "find" function to search on the page for words within the image - at least those that got accurately captured via the OCR process.
I can't remember why I didn't post the Python code, so here it is below. It could easily be modified to write to an SQL database or something so that one could create a small digital library with SQL and PHP on a simple site that would allow one to search for snippets of text on a website and click on a search result and then - via some JavaScript - be taken to the specific portion of a page/image that contains the text.
Maybe I'll actually do something like that one day, but for now here's the code.
'''
@title: Okra Pie
@author: Nitin Arora
- this script will take an argument (e.x. "foo") and assume the existence
of "foo.tif".
- it will then use Tesseract and ImageMagick to create, in the "output"
folder, the following:
- "foo.html" - the Tesseract HOCR/XHTML output,
- "foo.png" - a PNG version of the TIFF file,
- "foo.okra.html" - an HTML file with the OCR text overlaid on top of the
PNG file.
example usage:
$ python ./okra.py foo
'''
#### import modules
import sys, codecs
from PIL import Image #http://www.pythonware.com/products/pil/
from lxml import etree #http://lxml.de/
from lxml.html import *
import urllib
#### make hocr file with tesseract and PNG with ImageMagick.
try:
fp = sys.argv[1]
fps = tuple([fp]*4)
import os
run = ("tesseract %s.tif output/%s hocr | convert %s.tif output/%s.png") %fps
print run
os.system(run)
except:
print "You must pass the filename prefix for your .tif file."
sys.exit()
#### get PNG image size.
im = Image.open(fp + ".tif")
im_width = im.size[0]
im_height = im.size[1]
#### parse hocr file.
fo = codecs.open("output/" + fp + ".html", "r", "utf-8")
fo_r = fo.read()
root = fromstring(fo_r)
ocrWords = root.findall('.//span[@class="ocr_word"]')
#### place each word and its coordinates into a list as a dictionary.
wordList = []
for ocrWord in ocrWords:
node = ocrWord.find('.//span[@class="ocrx_word"]').text_content()
if node != None:
word = {}
word["text"] = node
coordinates = ocrWord.get("title")
coordinates = coordinates.split(" ")
coordinate = coordinates.pop(0) #remove word "box" from attribute value.
coordinates[2] = int(coordinates[2]) - int(coordinates[0])
coordinates[3] = int(coordinates[3]) - int(coordinates[1])
word["left"] = coordinates[0]
word["top"] = coordinates[1]
word["width"] = coordinates[2]
word["height"] = coordinates[3]
if (int(word["left"]) <= int(im_width)) and (int(word["top"]) <= int(im_height)):
wordList.append(word)
fo.close()
#### create output HTML file with image and words (overlaid).
fo = codecs.open("output/" + fp + ".okra.html","w","utf-8")
header = """<!DOCTYPE html>
<html>
<head>
<title>Okra Pie</title>
<meta charset="UTF-8" />
<script type="text/javascript">
function hideImage(){
var im = document.getElementById("image");
var ocr = document.getElementById("ocr");
im.style.display = "none";
ocr.style.color = "black";
}
function showImage(){
var im = document.getElementById("image");
var ocr = document.getElementById("ocr");
im.style.display = "block";
ocr.style.color = "transparent";
}
</script>
</head>
<body>
<div id= "image" style="position:absolute;z-index:-1">
<img src="%s.png" />
</div>
""" %fp
fo.write(header)
fo.write('\
<div id="ocr" style="color:transparent;opacity:0.5;background-color:transparent;">\n')
for word in wordList:
wordSpan = (word["left"], word["top"], word["width"], word["height"], word["height"], word["text"])
#tag = '\t<span data-X="%s" data-Y="%s" data-W="%s" data-H="%s">%s</span>\n' %wordSpan
tag = '\
<span style="left:%spx;top:%spx;width:%spx;height:%spx;font-size:%spx;position:absolute;">%s </span>\n' %wordSpan
#note the whitespace at the end so browsers can search for two or more words with a space in between.
fo.write(tag)
fo.write("""\
</div>
</body>
</html>""")
fo.close()