Re: Corpora: Bigram, Trigram and ???gram if N=4 ???

Mike Lake (jmlake@cogsci.uiuc.edu)
Thu, 24 Jul 1997 17:41:26 -0500 (CDT)

Lluis Padro wrote:
> In addition, I found in WordNet: billion, trillion, and *quadrillion*.

One should view these forms with care, since they were consciously
constructed in response to pressing scientific and commercial needs.
These forms first emerged between 1490 and 1700. Their history
has been fairly well documented and reconstructed, although naturally
some gaps remain. Most of the story is told in K. Menninger's (1969)
"Number words and number symbols: a cultural history of numbers",
(reprinted by Dover in 1992) and the OED. These number names are
probably only known to those familiar with either the Latin counting system
or one or more of its Romance descendants, or those with specific interest
in the names of very large numbers.

Current dictionaries of American and British English provide similar
names up to e.g., _decillion_ and _vigintillion_. N.B.: these names
are ambiguous: decillion is 10^33 in the American system, but 10^60
in the British, reflecting the mid-19th century (French) change in
what meaning attached to the artificial morpheme -illion. France has
since reverted to the "a power of 10^6" meaning, but American influence
is muddying the waters (again).

The first documented uses appeared in the late 15th century in the
works of Nicolas Chuquet; Locke's (1690) _An Essay on Human
Understanding_ referred to the then dominant practice exemplified by
"million millions," which named 10^12. Both suggested naming schemes
which effectively created the artificial morpheme -illion with the
meaning 10^6, reducing Latin prefixes and cardinals to fit with the
morphophonology of English. The French arithmetists of the early 19th
century reinterpreted -illion relative to a grouping by threes system,
giving it the meaning 10^(3+3i) with the prefix supplying the value of
i: bi- + -illion = 10^(3+3*2) = 10^9.

This history has several obvious parallels to the case under discussion:
the use of an artificial (or perhaps simply captured) morpheme, analogical
formation of completed forms using that morpheme, and confusion as to
what the "proper" forms are and/or what they mean. Our first and foremost
goal must be clarity in communicating methodologies. This would favor
an explicit formula which minimizes confusion: 2-gram, 3-gram, 19-gram,
etc. Who wants to type ``undevigintigram'' more than once, anyway?

Mike

-- 
J. Michael Lake                                       jmlake@cogsci.uiuc.edu