I’m working with the OpenPlaques folk to create a system that automatically ‘reads’ images of English Heritage plaques and extracts a transcript of the plaque’s text. This is a classic optical character recognition project. Here’s a simple example (thanks Fiery Fred):
The text is very easy for a human to read but very hard for a computer to extract. Thankfully there’s a great open-source OCR system (tesseract) that was released by HP several years back.
The goal of this project is to automatically transcribe the text from several thousand plaques (taken using different cameras and phones in varying lighting conditions at various angles and distances) so that the human sysops don’t have to do the transcription work by hand!
Currently there’s a Python demo file which retrieves three example plaques from flickr, passes them through tesseract and then uses the Levenshtein distance as an error metric against a manually transcribed string.
Once the OpenPlaques team puts together a larger test and validation set I’ll setup a monthly challenge. The challenge will have a cash prize, the goal will be to encourage entrants to write better recognition systems each month up until we can run an automatic algorithm against the entire OpenPlaques corpus.
If you’re interested in getting involved please join the A.I. Cookbook Google Group.
The following text will give you more detail on OCR techniques.