Re: Corpora: Bigram, Trigram and ???gram if N=4 ???

Peter R. Burton (burto009@maroon.tc.umn.edu)
Thu, 24 Jul 97 10:00:00 -0500

> 4-gram is far preferable to invented words. It has the not
> inconsiderable advantage that everyone understands it and uses it already.
>
> The logic of fancy latinate forms in English resides in the depths of
> British class structure as promulgated through the study of classics
> at public (ie private) schools. It suits the ruling classes to baffle
> their underlings by using words which require knowledge of dead
> languages to understand. The translation of the bible into the "vulgar
> tongues" (eg from Latin/Greek into the languages people spoke) was a
> body-blow to feudalism but the work isn't finished yet.
>
> You'll also hit a problem with 5-gram which would presumably be
> 'pentagram' - phonetically fine, but unfortunately it already has a
> meaning: "a five-pointed star used as a magic sign".
>
>
> Adam Kilgarriff
>

Each language has a flow to it that fluent speakers immediately sense without
the need for reflection. Those who would introduce new words to a common
language would therefore do well to choose ones that sound like they fit the
natural flow of the language.

The English language has many words in common use that are derived from other
more ancient languages. Two of these languages are Greek and Latin. The
latter has had several different periods of strong influence on the development
of English vocabulary. Consequently anglicized spellings of Latin or Greek
words are often adopted for English use - they tend to fit the flow of English
language because there are already so many similarly structured and sounding
words in English. Similarly so are some words related to other ancestral
English influences - e.g. words of the Germanic, Scandinavian, Celtic languages.
For similar reasons words borrowed from the more obvious modern descendents of
Latin (like French, Italian and Spanish) are also adopted easily by English.

In the very recent past many (usually commercial) spelling concoctions have been
engineered that do not simply adapt words from the ancestral traditions of
English. Some of these words will no doubt stay with English for a long time
but most would not. I suspect n-gram could stay because of its generalized
mathematical sense. But, perhaps ironically, I recommend using a combination of
<gram> with a normal sounding English prefix, whether it comes from Greek, Latin
or some other likely to be easily understood ancestral relative of English.

So I suggest answering how it sounds is more significant than addressing social
class problems. Because of the mixed heritage of English it may not matter
whether <gram> is of Greek and a prefix is of Latin ancestry. Multiple meanings
for English words do not usually pose problems for those fluent in the language,
but they definitely do for machines, so allow that real people will cope quite
easily with spellings that in different contexts stand for quite different
meanings.

Peter Burton