I’m working with the OpenPlaques folk to create a system that automatically ‘reads’ images of English Heritage plaques and extracts a transcript of the plaque’s text. This is a classic optical character recognition project. Here’s a simple example (thanks Fiery Fred):
The text is very easy for a human to read but very hard for a computer to extract. Thankfully there’s a great open-source OCR system (tesseract) that was released by HP several years back.
The goal of this project is to automatically transcribe the text from several thousand plaques (taken using different cameras and phones in varying lighting conditions at various angles and distances) so that the human sysops don’t have to do the transcription work by hand!
Previously I’ve posted about my work in progress with a manual process, now I’m building towards an automated solution, progress is outlined in the wiki as Automatic Plaque Recognition.
Currently there’s a Python demo file which retrieves three example plaques from flickr, passes them through tesseract and then uses the Levenshtein distance as an error metric against a manually transcribed string.
Once the OpenPlaques team puts together a larger test and validation set I’ll setup a monthly challenge. The challenge will have a cash prize, the goal will be to encourage entrants to write better recognition systems each month up until we can run an automatic algorithm against the entire OpenPlaques corpus.
If you’re interested in getting involved please join the A.I. Cookbook Google Group.
Next step:
The following text will give you more detail on OCR techniques.
Really?
Rion Snow et al “Cheap and Fast — But is it Good? Evaluating Non-Expert Annotations” show that you can get 1000 labels per $1 for simple tasks, using Amazon Mechanical Turk.
For several thousand plaques, I am confident you could get highly accurate labels from crowdsourcing it, and that would cost $100.
I confess that my underlying motivation is to understand how far OCR can go on a constrained (if very real-world) problem.
Currently humans are 100% involved, I’d like to see them 1% involved.
That 1% would probably be plaque-loving crowd sourced members but sure, Turkers would do fine if somebody put up some cash.
Thanks for the paper link, I’ll take a look. I do suspect that transcribing 20-100 words * 1000 with verification would cost more than $1000 though?
Have you come across http://kaggle.com/ before? It might be a useful venue for the competition you plan.
I’m not affiliated with them, I’ve just made a submission to a competition hosted there previously.
[...] “Currently I have a manual process which gives a human-like result (99% accuracy including spaces and punctuation errors). I’m working on an automated process: http://blog.aicookbook.com/2010/07/automatic-plaque-transcription-using-python-work-in-progress/” [...]