Corpora: Morphology and Word Length (was: Relatve text length)

From: Mike Maxwell (maxwell@ldc.upenn.edu)
Date: Fri Apr 26 2002 - 15:36:55 MET DST

  • Next message: Bernhard Schroeder: "Corpora: 2nd CfP ESPP 2001"

    Damlon Davison writes:
    >It may be obvious, but agglutinating languages
    >tend to have longer words

    --or at least the _average_ length of words in agglutinating languages tends
    to be longer, which presumably is what is meant here. Languages like
    English that have substantial derivational morphology can have some long
    words, but a glance at a text in an agglutinating language like Quechua will
    show the difference in average length.

    I suspect polysynthetic languages also have long word lengths, but whether
    that's true on the average, or only of some words (verbs with incorporated
    nouns, say), I don't know. I've never looked at an extended text in such a
    language. And of course compounding can create long words (look at a German
    text), and perhaps reduplication in languages that use whole-word
    reduplication.

    I suspect that another influence on word length is the phonology: words with
    large phoneme inventories tend to have shorter words. Does anyone have data
    on this? E.g. languages with large numbers of consonants (the Caucasus
    region?), or languages with lots of tones (some Chinese languages--in
    Romanized scripts, of course!, or Chinantec languages (Mexico)), as opposed
    to languages like Hawai'ian, which is notorious for a small phoneme
    inventory (around 13, as I recall) and long words.

    Since there are at least two factors related to word length (morphology and
    phonology), and several different factors within morphology, I wonder
    whether anyone has experimented with automatic classification of
    morphological type. We're having a workshop at the ACL this summer on
    morphology learning, but it ought to be able to get a rough idea of how many
    affixes there are without learning the "entire" morphology. Perhaps just
    seeing how compressible a text is would give you some idea, or turning it
    into a minimized FSA.

    Finally, there is a big caveat: the length of a word depends very much on
    orthographic decisions. Are clitics written solid? Compounds?

    Written German has long 'words' because the compound nouns are written
    solid. If they were written with a space between the nouns, the word length
    would become a lot shorter--not to mention how much easier it would be to
    read. I guess the original observation on this is by Mark Twain :-).

    I have even heard of a language where the linguist who designed the
    orthography decided to write a space between each morpheme, turning an
    agglutinating language into an isolating language in the orthography! (One
    wonders how the written language will look after a generation or two.)

         Mike Maxwell
         Linguistic Data Consortium
         maxwell@ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Fri Apr 26 2002 - 15:43:32 MET DST