How to Measure OCR Accuracy

If you are going to improve your OCR accuracy, you first need to know how to measure accuracy, so that a baseline can be created against which improvements can be tested.

OCR software calculates a confidence level for each character it detects. Word and page confidence levels can be calculated from the character confidences using algorithms either inside the OCR software or as an external customized process. The OCR software doesn't know whether any character is correct or not – it can only be confident or not confident that it is correct. It will give it a confidence level from 0-9. True accuracy, i.e., whether a character is actually correct, can only be determined by an independent arbiter, a human. This can be done by proofreading articles or pages, or by manually re-keying the entire article or page and comparing the output to the OCR output. These methods are very time consuming.

OCR contractors often talk about OCR confidence levels and OCR accuracy as if they were the same thing, and in practice confidence levels are often used as a substitute for accuracy because determining true accuracy is not feasible for large volumes of text. Only one contractor to whom we spoke suggested a good solution for gaining an accuracy figure (rather than a character confidence figure) for libraries with large scale text projects. We thought the contractor's idea was potentially viable, and could be resource effective and accurate, but as far as we know, no one is actually utilizing it at present. The idea is as follows:

Determine the actual accuracy of a sample of the OCR output by a manual method.
Gather the OCR confidence levels for the same sample.
Write an algorithm that will correlate the two and provide a 'proxy' accuracy level based on both.
Use the algorithm to supply proxy accuracy levels at article/page level.
At regular intervals re-check the algorithm, since it may change with different newspapers.

We concluded that being able to identify real or proxy accuracy at article level would be the most useful thing for us rather than having a figure for the entire newspaper corpus (which was too broad), or the individual title (since most changed in layout and quality over time), or the entire page. Being able to identify accuracy levels at article level would give us the potential to be able to measure increase in accuracy now or in the future.