Tuesday, 2 May 2006

A schmillion words

How many words are there in English?

Just around a million.
Did you know that by November 2006 there will be one million words in the English language?

That is the estimation from Paul Payack, the President of Global Language Monitor, a non-profit group that documents, analyses and tracks trends in language the world over.

Oh, too late, Paul. Oxford has already clocked a billion words.
A massive language research database responsible for bringing words such as "podcast" and "celebutante" to the pages of the Oxford dictionaries has officially hit a total of 1 billion words, researchers said Wednesday.

I'll see your billion and raise you a brazilian.

Oh, but wait. It appears those aren't separate word types. They're just tokens. In other words, they've assembled a corpus of a billion words, but most words are repeated.

Determining how many words there are in English isn't so hard from a compiling standpoint. It's easy to write a Perl script to plow (or plough) through a corpus and count individual word types. No, the problem is defining what a word is. 'Cat' and 'cats' are two words, but one lexeme, so dictionaries list them under the same heading. What about polysemous words? The mouse on my computer and the mouse my cat runs from are two separate things; should they be one word or two? What about multi-word expressions: is 'ice cream' one word or two? Does it matter if I spell it with a hyphen? Any one of these decisions will change the count and complicate the task. So when someone reports 'English has X words', they really ought to clarify what definition of 'word' they used.

The reason I'm bringing this up is that I know someone's going to come to me and say, "Did you know that there are a billion words in English?"

And then I'll say, "UK billion, or American billion?"


  1. I only have one word for you...Happy Birthday!!! Opps, thats two...Stirling or us standard?

  2. Oh, thank you. Or is it 'thank-you'?


