A month back I posted about my progress with the automatic English Heritage plaque transcription project using Optical Character Recognition (using tesseract) and Python for OpenPlaques. The post mentions a monthly cash prize for progress towards a solution…
A few days back Jonathan Street announced his entry in the challenge’s thread – he’d beaten my initial average error of 709 (it was in part designed to be easy to beat!) and quickly brought it down to 33.4. Jonathan becomes the winner of the A.I.Cookbook’s first challenge, the challenge now rolls on to this month and the same prize is offered.
I’ll be presenting the results at the open day for the Open Plaques project (sponsored by the Royal Society of the Arts) on 25th September and I hope to be able to demonstrate that, for a few plaques at least, we can automatically get a good transcription.
In Jonathan’s write-up he describes the main steps and includes full Python source:
- image pre-processing to find the blue regions
- restricting tesseract’s character set
- spell checking
- word clean-up (to fix things like dates)
He’s taken some of the ideas I listed in the wiki and taken them further – I’m particularly happy with the blue region detection as that felt like an obvious first step that I hadn’t attempted.
David Rawlinson also posted in the thread about some ideas taken from his experience with automatic number plate recognition (ANPR) so we can correct mis-recognised characters (e.g. 0O0 and 1lLiI are easily mis-recognised by OCR!).
The competition runs on, the new deadline is Thursday September 23rd so I can present our progress on the 25th at the OpenPlaques event. If nobody beats Jonathan’s result by then then he becomes the winner by default. I’ll be adding more ideas for improving the result into the main wiki page. Join the Google Group if you’d like to offer ideas and get involved.