Optical Character Recognition webservice work-in-progress

This is a quick progress report on my webservice for optical character recognition using the open source Tesseract engine. This builds on my post a month back ‘Tesseract OCR to read plaques‘.

The immediate goal is to let the OpenPlaques folk have an automatic service which machine-reads English Heritage Plaques (blue plaques – very common at historic sites in the UK) from their flickr photos and then squirt out the English text. Currently volunteers are transcribing the text by hand.

Below you’ll see a quick demo, I’ve used the bottle.py microframework to run my webservice, it takes a URL to an image, converts it to a TIFF image, passes it into Tesseract and presents the recognised text as a text output.

This isn’t live on the web yet (it needs a bit more work) but shortly it’ll be up for public use.

Update – following this tesseract image clean-up advice (isolate text region, threshold, convert to b&w) I can extract very clean text – contrast these results with what you see in the video.

IN
THIS HOUSE
LIVED
RALPH ELLIS
1885 – 1963
ARTIST
PAINTER & DESIGNER
OF
INN SIGNS (Note – I extracted the inner circle so Sussex isn’t shown)

THIS WALKWAY
WAS DONATED BY
JEAN & BRIAN CROSSLEY
OP BROCKHAM (Note – 1 typo here with OP)
TO CELEBRATE
JEAN’S 80th BIRTHDAY
DECEMBER 27
2007

Comments are disabled for this post