image The official Google blog contains an announcement of a new strategy for building a more comprehensive index of the text on the web.

A substantive fraction of Web documents are embodied in PDF (Portable Document Format) files that consist of images in series. These files do not contain text directly. Instead, they contain pictures of text, and any search engine that wants to include these documents in its search results must first perform additional processing to extract the text.

Many of the PDF files that use images for each page were not “born digital.” Often a paper book, manual, or report was converted into an electronic document by scanning each individual page and combining the results to yield a PDF file.

Product Manager Evin Levey discusses the technique Google is now using to improve its search capability:

In the past, scanned documents were rarely included in search results as we couldn’t be sure of their content. We had occasional clues from references to the document—so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe’s PDF format. This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found.

In essence Google has decided that computational power is cheap enough and OCR is accurate enough that it is time to expand their database. One enterprising blogger is already suggesting that Google spiders can be enlisted to automatically perform OCR for your documents.

Finally note that some PDF files do contain text, and those documents have already been included in the Google index for some time. Also the Google Book Search project has used OCR from the beginning on the books that Google has processed itself.

The image above is a book scanner photographed by Ben Woosley from the Creative Commons collection at Flickr.

NO COMMENTS

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.