Google Drive on-the-fly OCR takes old ebook scanning to new highs

May 11, 2015

1360

For all those pesky public domain PDFs that you may have lying around that can’t be shoehorned into more flexible ebook formats without considerable inconvenience and inaccuracy, Google may have the ideal solution. Google Drive, the default cloud archive and storage service for Google fans, now offers full Optical Character Recognition (OCR) services for conversion of digital images or PDF documents into texts – and this service has now been extended to some 200+ languages worldwide. According to Google, “images can be processed individually (.jpg, .png, and .gif files) or in multi-page PDF documents (.pdf).” And as well as flatbed scans, “photos taken with digital cameras or mobile phones” can be used.

I don’t need to enlarge on the possibilities of this, but if you have a trove of old PDFs that you’ve been itching to convert into something more usable, now could be your time. This could also be the time to start turning those old handwritten notes and jottings into something that other people can read, authors. I complained about the lack of such a feature in Evernote a while back: now it seems that Google may have got there instead. Indeed, with the release of the Google Handwriting Input app, it looks like Google may be looking to really corner this area of technology. Well worth trying out.

1 COMMENT

Steve May 11, 2015 at 12:35 pm

This could be really terrible advice.

Even the best OCR will introduce lots of character recognition errors. Expect at least a few per page! You don’t want to incur those errors unless it’s absolutely necessary.

PDF is a flexible format that incorporates everything from a series of jpg pictures of entire book pages to detailed instructions like “put the characters “the” at coordinates (127,2048) on the page”. If your PDF document is nothing but a string of page images, then OCR is your only alternative. But if the PDF already contains the actual individual characters, you should run away screaming from anyone suggesting that you do OCR.

How can you tell the difference? Using most PDF readers, try selecting some of the text for copy-and-paste. If you can’t select anything smaller than the whole page, you need OCR. If you can select individual words and characters, you probably don’t want to do OCR. Then try copying and pasting hte selected text into your favorite word processor. If that yields anything remotely correct, don’t do OCR. You may get an odd character here or there because of ligatures or other special glyphs that don’t work in simple copy-and-paste, but a more sophisticated PDF converter will handle those, without introducing the inevitable errors that OCR would introduce.

Log in to leave a comment

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com. Cancel reply

You must be logged in to post a comment.

Share this:

Related

1 COMMENT

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com. Cancel reply

AMAZON

REVIEWS: E-Book & AUDIO BOOKS

SELF PUBLISHING: TECH & BIZ TIPS

MOST RECENT

POPULAR POSTS

MAJOR CATEGORIES