Over the years I’ve scanned and OCR’ed many printed books into electronic form for Gutenberg Australia—most of the Edgar Wallace collection there is my work, for instance—and during that time it’s become clear that not all typos are equal. After awhile, in fact, it became possible for me to divide typos into categories, as follows:

Category 1: Typos due to English orthography

Some letter sequences in English serif text happen to resemble others. The sequence ‘of her’, for instance, looks very much like ‘other’, and ‘thing’ looks very much like ‘tiling’. Every second or third book I scanned had these mistakes in it somewhere.

Category 2: Typos due to the publishers’ choice of font

These used to arise when I was scanning a set of books in the same series. Some fonts/size combinations happened to trick the OCR software in consistent ways. Many series had very narrow risers on the ‘h’, for instance, making it easy to mistake them for an ‘n’. Extra spaces were also very common.

Category 3: Typos in recurring proper names

Many of the books I scanned belong to series featuring ongoing characters and ongoing locations. If the name ‘Winstanley’ was misread ‘Wihstanleg’ in the first of these, I could be reasonably sure that the OCR would get that name wrong in the same manner in the rest of the book—and the other books, too.

Category 4: Words that just don’t belong

If I come across the word ‘modem’ in the OCR’ed scan of a book from 1935, for instance, I know I can change it to ‘modern’ without a second thought. The same applies to apparent U.S. spellings in books published in the UK, and vice versa.

Category 5: Typos that a spelling check will pick up

Good OCR programs have a plausibility tester built in to block words that don’t match English spelling, but this is often overridden. My final pass through the book is always done with a spelling checker, and I usually pick up a half-dozen errors.

Category 6: Someone else’s typos

There are always a few of these—errors by the author, the editor or the typesetter, which have crept into the original book.

* * *

What does this imply for efficient proofreading? It means that mindless page-by-page comparison of the original text with an OCR copy is just about the least efficient way to do it, because it requires predictable errors to be corrected over and over again, rather than through global changes. What’s more, it retains—sometimes even cherishes—the typos in Category 6, just because they happen to be typos in the original book. And yet this is the method currently used by most proofreading projects, including Distributed Proofreaders.

Here, by contrast, is what an informed proofreading approach looks like:

Step 1: OCR a cluster of books by the same author, in the same series, featuring the same characters, as far as possible. Gather these into a single word processing file. Use a master document if your software allows it.

Step 2: Run a macro to find and correct Category 1 and 4 errors. That includes changing straight quotes to smart quotes, removing characters that don’t occur in novels—at least in those written before the copyright period—like ‘>’ and ‘=’; highlighting potential error points like ‘of her’ and ‘other’; changing ‘modem’ to ‘modern’ and so on. Yes, this sometimes makes mistakes, but it fixes far more errors than it introduces, and if you highlight the changes as you make them, it’s easy to spot those rare points at which you’ve introduced an error rather than removing one.

Step 3:  Start proofreading. When you find an error that doesn’t match English orthography—‘dosn’t’ for ‘doesn’t’, for instance—do a global search and replace, and highlight the replacements. It’s extremely likely that you’ll find the same error several times in the same file, though the frequency drops as the more common errors get corrected. The same applies to names of prominent places and ongoing characters like ‘Wihstanleg’. Common errors should be added to the macro from Step 2, so they’ll be found and fixed in the next batch.

Step 4: When you find an error that could be a real word—‘hell’ for ‘hello’—correct it in that one location only, but highlight the word throughout the document to make it easy to pick up similar errors where they occur.

Step 5: When you finally finish proofing the whole set of books, do a spelling check, remove the highlighting, then break them back up into their component files.

Seems like a lot of work? Yes, but it pays off. I estimate this approach saves me an hour of proofreading on a normal-sized novel—more on books with strange fonts or poor printing. And it’s much less stressful to fire off a whole volley of corrections at once, knowing you won’t have to deal with them again, than to meet your old friend ‘Wihstanleg’ for the eighth time in the same chapter.

This method also has the advantage that you very seldom need to refer to the source material. Sometimes, where whole lines have been omitted or garbled, you won’t have any choice, but I find I can correct 95 percent of a reasonably well-printed book without needing to refer to the original text at all.

Distributed Proofreaders regard this as anathema. I call it intelligent proofreading. You decide.

 

3 COMMENTS

  1. I’m not a professional proofreader, but I’ve done the electronic galleys of all my books before they were published. Here’s my usual recommendations to other writers. They can use only one method or do several proofs using different methods each time.

    Use text to speech, all computers come with it, to have your computer read it aloud. In the preferences, set the talking speed a bit faster than usual so you won’t lose focus.

    Change the font and text size. Make it much bigger than normal so those misplaced commas really stand out. If you begin to skim, change the size again.

    If you have an ereader, transfer your book to it.

  2. I am also just a happy amateur, but I have found great use for regular expressions when proofreading. They can test for nonsense like a capital letter immediately after a lower case one, a numeral immediately after a letter, a punctuation mark immediately before a letter, all isolated letters except ‘a’ etc. etc.

    When proofreading, I like to create a document with the scan as an image on one side and the OCR’ed text on the other, so it is straightforward to compare the OCR with the source. The problem with proofreading without the source immediately available is that errors are not always readily apparent from the ocr text; even complete lines can go missing sometimes and the text will still make sense.

  3. I’ll second Marilyn’s remark about the value of text-to-speech. By using another sense, it gets around the problem of seeing what we think is there rather than what is really there. The best way is to have the text read to you while you follow along. The most dangerous sorts of typos are those that make grammatical sense. I once found one where “now” was substituted for “not.” With practice, you can even develop and ear that picks up the difference in pause, in some text-to-speech software, between a period and a comma.

    SBT is also right about regular expression searches. I once had to clean up an OCRed document for Microsoft that was filled with the letter l substituted for the number 1. The two didn’t look that different, so finding them by eye would have been a pain. Instead, I did massive search and replaces. That was much quicker and error free.

    I’d add another suggestion for those who’ve got an epaper reader or a tablet. Export what you’re proofing to it. The change in appearance from your computer screen will make many typos pop out just as effectively as printing to paper and without the expense of paper.

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.