Optical Character Recognition and eBooks

Problems with Optical Character Recognition and eBooks

Optical Character Recognition systems are extremely good - errors of less than one in a hundred characters are often quoted, and some - Tesseract and Abbyy are good examples - can be much better.

Nonetheless, errors remain, and as a result proofreading is required.

It turns out that conventional writing tools - MS Word, Open/Libre Office, Lyx etc - are not particularly good at this. Firstly, they fail to recognise large numbers of words which are not in their dictionaries but which are correct in the context of the document - often names, sometimes neologisms, and sometimes variant spellings - and so class them all as errors. Each one must be considered by the proofreader - a serious workload. Secondly, having identified an spelling mistake, they are terrible at presenting a choice to correct it. This is because the noise model used to suggest corrections makes the logical assumption that there's a human at the keyboard and that the errors made will be basically in one of two classes: the human can type but can't spell (so look for common spelling errors), or the human can spell but can't type (so look for mishit keys). An OCR system generates words which, in some way, look like the required word... but where, for example, 'h' has been transcribed as 'li'.

The Text Contains its Own Lexicon: Extracting a Spelling Reference in the Presence of OCR Errors.

The dissertation shows a robust method for dealing with a number of common OCR faults automatically, and shows how - with a sufficiently large text - non-dictionary words correct in context can be identified.

In the process of writing the dissertation, a basic proofreading application was developed. This implements many of the techniques described in the dissertation, but is as yet too basic and unstable for general release. I'm in the process of tidying this up - it's a GTK+ application developed on a Linux system, but I am led to believe that it could be easily converted to Windows. I'm not an expert, so the source and a Linux binary are what you will get - open sourced, naturally.

It will appear here, eventually. Meanwhile, the dissertation is available to help you to sleep... the dissertation (pdf).

A note on formatting

This research arose because I had a need to scan decades' worth of old science fiction magazines. My need was to grab the fiction sections and convert them into a form suitable for eBooks - in my case, the ePub form.

ePub is quite capable of simulating the original page layout of the scanned image, but that only works if the display device is large enough to display the whole page. To be cross platform compatible, a streamed text is required. (Consider, for example, what happens if you try to display a two-column layout on a small screen.)

OCR programs go from one extreme to the other in their layout analysis - for example, at the time the open source Tesseract and Cuneiform for Linux could only deal with single columns, and produced no markers for italic or bold text, nor for titles. On the other hand, Cuneiform 6.0 for Windows (no longer available) and the excellent Abbyy 8.0 for Linux can produce - in RTF format - an almost exact representation of layout, font, and column as the original scan. Indeed, Abbyy can produce an XML output which includes the exact position on the scan for each physical line of text.

What all lack is the ability to output a simple stream including italic and bold markers - yet this is exactly what I require.

This seems largely to be because they deal with individual lines, and don't identify a paragraph as a single entity. Equally, it's tricky to persuade Abbyy that a line might continue on the next page... However, there is a reasonably simple way: an entire magazine is scanned, and the images converted to a single multi-image tiff file - this allows Abbyy to deal with the whole document in one hit. Tell Abbyy to output an rtf file which can be opened with OpenOffice. Tell OpenOffice to save this as an html file and bingo - we suddenly have paragraphs.

My proofreader application opens this html and saves in the same format - with markup limited to the desired italic, bold, and three levels of headers. OpenOffice reads this and converts to its internal odt format; the excellent Calibre program knows how to convert that to the ebook format of your choice.

It's a bit round-the-houses, but it works...

Copyright © 1995-2011 Neil Barnes - 23 June 2011