Re: Corpora: Bigram, Trigram and ???gram if N=4 ???

george@scs.leeds.ac.uk
Wed, 23 Jul 1997 16:31:26 +0100

[...
>I still cannot be sure which
>of the following terms is best for 4-gram (n-gram with n=4):
>
> quadgram
> quadragram
> quadrugram
> ???
>

...
In short, Quadrigram would be the most agreeable form. Another argument:
Bi-, Tri-,... next should be Quadri- (it's not Bu-, Tru-,...). Tetragram is
out, because Tetra- is Greek, Bi- and Tri- are Latin.
...]

The prefix tri- is definitely Greek (as well as Latin?) but if one wanted to
make consistent use of Greek terminology he/she would use 'digram' instead of
'bigram' and 'monogram' instead of 'unigram'. Indeed, in old books and
papers on information theory the term 'digram' is much more common than
'bigram' (see for example, Shannon's papers). But today, the use of a term such
as 'digram' in a field like speech recognition might sometimes imply that the
writer is old-fashioned or an outsider.

Personally, being a Greek I would very much like to support the idea of using
the Greek prefixes in such terminology (e.g. tetragrams, pentagrams, exagrams
etc.). But sometimes authors have to think about the wider audience the
paper is addressed to and the possible variations in language and cultural
backgrounds. As far as I remember a number of papers in recent ARPA and ICASSP
conferences have adopted more 'universal' terms such as 4-grams or four-grams
that can also be perfectly acceptable and understandable by most (although
perhaps not as elegant as tetragrams or quadgrams).

============================================================================
George C. Demetriou
Centre for Computer Analysis of Language And Speech (CCALAS)
& Artificial Intelligence Division, School of Computer Studies

phone: +44 1132 336827 Leeds University
FAX: +44 1132 335468 Leeds LS2 9JT
Email: george@scs.leeds.ac.uk United Kingdom
============================================================================