Home Libraries Google Ngrams: OCR and metadata

Libraries

Google Ngrams: OCR and metadata

December 19, 2010

370

Most of the the press and commentary we’ve seen about Google’s new Ngram Viewer has been extremely positive (here’s our post from last week with links to several articles). However, today we came across a very interesting and very well written/documented blog post by Natalie Binder, a librarian and information science student at Florida St. University.

“Google’s word engine isn’t ready for prime time” (by Natalie Binder, The Binder Blog)

Here are two brief paragraphs from the blog post:

The whole idea of Ngrams is built on a shaky foundation: the accuracy of Google’s optical character recognition (OCR) software. OCR is how a computer “reads.” When a paper document is scanned, it’s essentially a “dumb document.” The text is not searchable because a computer doesn’t know the difference between a printed word and an image.

[Clip]

Accurately digitizing a book requires hundreds of hours of hard work, because a human being has to go through and hand-correct these errors (see my previous article on OCR, “A breadful book“).

She also points to the challenge of getting quality metadata when scanning documents. She also includes a link to the widely read 2009 blog post by Geoffrey Nunberg, “Google Books: A Metadata Train Wreck.”

Binder concludes:

This [metadata] is not a small problem. These types of issues are rampant in Google Books. Moreover, Google may not be able to count on crowd wisdom to fix these problems. Codicology (the study of the physical properties of books) and cataloging are highly specialized professional fields, both of which require Master’s degrees. These types of errors undermine trust in Google’s entire cataloging system. Until these issues are resolved, serious scholars of the humanities cannot approach Google Books as a trustworthy scholarly source. _Unless it can somehow prove its accuracy, ngrams sinks in the same boat.

We think it’s also important to mention that in the comments section of the blog post, Binder is very clear that the issues she’s writing about are not really an nGrams problem but with the scanned material it utilizes.

Don’t get me wrong; I think this is a great first step. These technical problems have to do with Google Books, not ngrams itself. If Google fixes Books, ngrams will be much better.

In a second blog post, “The problem with Google’s thin description” one point that Binder makes is that even if the data is correct, the multiple meanings of words can cause challenges.

See Also: “When OCR Goes Bad: Google’s Ngram Viewer & The F-Word” (by Danny Sullivan, Search Engine Land)

Via Resource Shelf

2 COMMENTS

shaftesbury December 20, 2010 at 2:59 am

The presence of long S in books before 1800 also creates strange results in Google Labs Ngrams graphs. See my piece, “Google Labs Ngrams Show Mysterious Rise in Pleasure”
http://multitude.tv/content/view/471/60/

Log in to leave a comment
Itsme December 27, 2010 at 11:37 am

People can find fault with anything. The whole point behind Google scanning all the books it that it didn’t intend to spend hundreds of hours per book hand correcting the OCR. Instead this experiment is about quantity, not quality.

Let your common sense prevail. Does the Google search engine get everything right? Of course not, but it gets very good results most of the time. In fact it gets such good results most of the time so much that it owns the search market. The same thing applies here. For every several hundred things right about this, there is one thing wrong. In my mind, the people pointing out the flaws are nothing short of stupid. Not just ignorant, but stupid.

Log in to leave a comment

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com. Cancel reply

You must be logged in to post a comment.

Share this:

Related

2 COMMENTS

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com. Cancel reply

AMAZON

REVIEWS: E-Book & AUDIO BOOKS

SELF PUBLISHING: TECH & BIZ TIPS

MOST RECENT

POPULAR POSTS

MAJOR CATEGORIES