This is a quick progress report on my webservice for optical character recognition using the open source Tesseract engine. This builds on my post a month back ‘Tesseract OCR to read plaques‘.
The immediate goal is to let the OpenPlaques folk have an automatic service which machine-reads English Heritage Plaques (blue plaques – very common at historic sites in the UK) from their flickr photos and then squirt out the English text. Currently volunteers are transcribing the text by hand.
Below you’ll see a quick demo, I’ve used the bottle.py microframework to run my webservice, it takes a URL to an image, converts it to a TIFF image, passes it into Tesseract and presents the recognised text as a text output.
This isn’t live on the web yet (it needs a bit more work) but shortly it’ll be up for public use.
Update – following this tesseract image clean-up advice (isolate text region, threshold, convert to b&w) I can extract very clean text – contrast these results with what you see in the video.
IN
THIS HOUSE
LIVED
RALPH ELLIS
1885 – 1963
ARTIST
PAINTER & DESIGNER
OF
INN SIGNS (Note – I extracted the inner circle so Sussex isn’t shown)
THIS WALKWAY
WAS DONATED BY
JEAN & BRIAN CROSSLEY
OP BROCKHAM (Note – 1 typo here with OP)
TO CELEBRATE
JEAN’S 80th BIRTHDAY
DECEMBER 27
2007
[...] – almost 100% perfect recognition results are possible, see OCR Webservice work-in-progress for an [...]
Hi,
I developed a webservice for OCR, it’s not based on Tesseract, but it’s free to use.
See http://www.free-ocr.co.uk/ocr.asmx
Dan