In my last post I discussed Google’s Ngram Viewer and the related paper in Science, Quantitative Analysis of Culture Using Millions of Digitized Books. As I said, the authors of the paper estimated the number of words in English to be 1,022,000 in 2000, but the Oxford English Dictionary (OED), which includes obsolete words, includes only something over 600,000 words. How could they miss so much?
Ah, but what is a word? As Supreme Court Justice Potter Stewart said when asked to define hard-core pornography, “I know it when I see it.” The concept of a word is just as intuitive, but probably more difficult to define. Try it.
The reason for the discrepancy becomes clear when you compare what the authors of the Science article used to count as a “word” with what we instinctively understand to be a word. There’s a reason it’s called “Ngram Viewer” and not “Word Viewer.”
As the authors explain, their “corpus” (collection of books) was much too large to be read by humans, so they used machines to scan and analyze it. Rather than words, the machines counted “1-grams,” or strings of characters uninterrupted by a space. The authors took random samples from different periods to estimate how many “1-grams” consisted of numbers rather than letters and how many were typos. They then assumed the remaining strings were words.
No wonder they found so many more words than the OED. They count “cat” and “cats” as two different words; “go,” “goes,” “going,” “gone” and “went” as five. No fair.