Ngram Viewer and “lexical dark matter”

Google’s new Ngram Viewer, http://ngrams.googlelabs.com/, lets you search for the of words, phrases and proper names in millions of books published between the 1500s and 2000 and instantly charts their frequency of occurrence over the decades.

Historians can see, for instance, that Washington surpassed Lincoln in number of appearances around 1928 and has held onto the lead.

PCMag.com reports:

Google has launched an online search tool for its Google Books database, essentially a Google Trends tool for poring through most of the world’s written works.

“The Ngram Viewer allows users to search “ngrams,” or word combinations, for a given corpus, or collection of books. Users can select from the entire corpus, or subsets, including the English language, Chinese, Russian, French, Spanish, English fiction, American English, or others.

Google said it includes over 500 billion words that are contained in books published between the 1800 and 2000 – 361 billion of those in English. [Actually the database includes books from the 1500s on, with the number of books per decade growing as the publication dates approach the present. –Lexie]

According to a paper, published on the Science Web site and available for free [with a subscription], the corpus represents 4 percent of all the books ever printed, or 5,195,769 books culled from about 40 university libraries around the world. In total, Google has scanned 15 million books, or about 12 percent of all of mankind’s written works…

[T]he paper is an intellectual playground of sorts for a study of the English language. The authors found, for example, that dictionaries are unable to keep up with advances in the language, often failing to add a new word until it is already on the decline. And the majority of words in the English lexicon do not appear in a dictionary at all.

“[W]e estimated that 52% of the English lexicon – the majority of the words used in English books – consists of lexical ‘dark matter’ undocumented in standard references,” the authors found.

http://www.pcmag.com/article2/0,2817,2374453,00.asp

The Science paper says, “[W]e estimated the number of words in the English lexicon as 544,000 in 1900, 597,000 in 1950, and 1,022,000 in 2000. The lexicon is enjoying a period of enormous growth: the addition of ~8500 words/year has increased the size of the language by over 70% during the last fifty years.”

Oxford English Dictionary includes only something over 600,000 unique words, citing usages over 1,000 years.  The authors of the Science article explain:

‘Part of this gap is because dictionaries often exclude proper nouns and compound words (“whalewatching”). Even accounting for these factors, we found many undocumented words, such as “aridification” (the process by which a geographic region becomes dry), “slenthem” (a musical instrument), and, appropriately, the word “deletable.”’

It’s fun to chart the rise and fall of words, phrases – and people. Take a look at Shakespeare’s popularity (at least as measured by mentions in books) over the centuries.

Advertisements
This entry was posted in history, words and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s