While digital content is great, trying to draw conclusions based on searching that content is fraught with peril. That’s the point Sarah Zhang makes in a feature for Wired about using the Google Ngram Viewer search tool to generate statistics based on Google Books digital content.

Ngram lets you track the popularity of words or phrases from decade to decade in the content Google Books indexes. While this can be fascinating, it can also be misleading due to idiosyncrasies in the way the content is scanned and indexed. For example, the lowercase long “s” in books from the 18th and early 19th centuries is often mistaken for an “f” by OCR software, which leads to the amusing example of the “f-word” apparently seeing common use in literature until 1820 or so, and then disappearing until 1960.

Another issue is that Google Books content isn’t necessarily a representative sample of all literature from all the periods in question. For example, there are considerably more sermons indexed from the periods before the 20th century, and considerably more scientific papers indexed more recently. And perhaps more critically, any given book only appears in the Google index once, whether it’s been read once or millions of times. Lord of the Rings, any given edition of the Bible, and some random research paper on genetics or quantum mechanics all have equal weight in the Google Books database.

That’s not to say that Ngram isn’t useful in research—there’s a right way and many wrong ways to use any given tool. But it’s important to know your tool’s limitations before you start your research.


The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.