images.jpgThat’s the title of an Ars Technica article today. The article discusses, at length, the problem with Google’s metadata and says:

Google’s counting method relies entirely on its enormous metadata collection—almost one billion records—which it winnows down by throwing out duplicates and non-book items like CDs. The result is a book count that’s arrived at by a kind of process of elimination. It’s not so much that Google starts with a fixed definition of “book” and then combs its records to identify objects with those characteristics; rather, the GBS algorithm seeks to identify everything that is clearly not a book, and to reject all those entries. It also looks for collections of records that all identify the same edition of the same book, but that are, for whatever reason (often a data entry error), listed differently in the different metadata collections that Google subscribes to.

But the problem with Google’s count, as is clear from the GBS count post itself, is that GBS’s metadata collection is a riddled with errors of every sort. Or, as linguist and GBS critic Goeff Nunberg put it last year in a blog post, Google’s metadata is “train wreck: a mish-mash wrapped in a muddle wrapped in a mess.”

3 COMMENTS

  1. The problem with this kind of easy critique by Jon Stokes … is that he offers no evidence whatsoever of a significant miscount – and specifically no evidence of an alternative figure. So … is it out by a million books do we think ? by ten million ? How can erroneous metadata negatively influence a simple count ? It certainly isn’t clear to me.
    I don’t mean to be too harsh … after all what does it matter … give or take ten or twenty million. It’s a heck of a lot of books. What it means to me is yet another journalist article that fails to inform on all fronts (that’s Stokes not you Paul).

  2. The internet is so full of actual data and facts and then you get to commentary like this. And it is only commentary since it provides nothing, no facts, no formula and no data.

    It’s just like how Amazon recently was claiming they had 80% of the eBook market and how everyone else is lying and again provided no data to prove what they were saying.

    You would think someone would tell them we won’t post that claim unless you back it up with some facts and numbers.

    This is the internet not some hard copy rag and we have room for that stuff you know.

  3. A sad reflection on the quality, or rather the lack thereof, of journalism across the spectrum these days. In other articles here we have seen the widespread lazy articles loudly espousing ‘The death of …’ where you can fill in the blank, along with the wider use of blatantly provocative titles that don’t bear any resemblance to the substance of the articles themselves. Publishing entities and web sites are clearly placing more emphasis on provoking negative responses, even outrage, and therefore web site hits from readers rather then on good informative and thought provoking writing.

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.