The Google Ngram Viewer or Google Books Ngram Viewer is an online search engine that charts frequencies of any set of comma-delimited search strings using a yearly count of n-grams found in sources printed between 1500 and 2008. When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., “British English”, “English Fiction”, “French”) over the selected years. There are around 450 million words that are readily accessible at the click of a button. Let’s look at a sample graph:
While this is all well and good, relying on Google Ngram to measure and track words over long periods of time has some snags, one expert even declaring Ngram is so beguiling, so powerful. Here are some of the problems:
OCR stands for optical character recognition and is when computers take the pixels of a scanned book and convert it into text. It’s not fully accurate, and proves difficult when computers are tasked with deciphering text that’s 200 years old. An example of this is this the confusion of sa and fa; the lowercase sâ in older literature is very similar to an fâ and has resulted in: case versus cafe, funk versus sunk, fame versus same. Because of this, you have to be aware of these discrepancies.
Invasion of Science Literature
In comparison, the mis-reading of letters is nothing; Sometimes, the text corpus gets warped in less obvious ways. Google Book’s English language corpus is a mishmash of fiction, nonfiction, reports, proceedings as well as lots of scientific literature.
The changing composition of the corpus over time isn’t a new criticism, quite a few people have noticed that the pre-20th century corpus is saturated with sermons. Psychologist, Jean Twenge, who has used Google Ngram to study narcissism notes that the fact that scientific literature grew so much is indicative of a societal shift. If scientific publications are becoming increasingly popular, this may cause a decline in the popularity of non-scientific terms.
Mixed Up Metadata
When scanning books, Google puts together the metadata (author, title, publication date etc). This is an automated process which, like OCR, mean it is subject to making mistakes. Examples of this were noted by University of California linguist Geoff Nunberg. Nunberg states that a search for Barack Obama restricted to years before his birth turns up 29 results. Consequently, most of these errors have been fixed by Google since.
One of the catches about using Ngrams is that a book only appears once – even if it’s been read once of millions of times. For instance, a mechanics paper only appears once, as does The Lord of the Rings, meaning the two texts have equal weighting and are not a reflection of the correlation between what people are talking about and what they are publishing.
Overall, Google Ngram is an extremely powerful tool that 10 years ago seemed in the very distant future. Now it is so simple to use that often it leads to overuse and misuse. The field has arrived at a backlash. Now, they just have to wait for the backlash to the backlash.