Re: Corpora: Diacritics and "deviant" texts in corpora

From: Doug Cooper (doug@th.net)
Date: Tue Apr 24 2001 - 10:27:27 MET DST

  • Next message: Chantal Perez Hernandez: "Corpora: Corpus software for Unix/Linux Alpha processors"

    At 11:15 22/4/01 -0500, Marco Antonio Esteves da Rocha <marcor@cce.ufsc.br>
    wrote:
    >I will assume that the questions I ask Doug are of general interest.
    Well, these are general problems for preparing corpora in writing
    systems that derive from old Indian scripts, and which are read
    as syllables, rather than in linear, left-to-right progression.

    >Now, could you be a little more precise about the meaning of a "zero-width
    >vowel character" ?
      Many Southeast Asian (and Indian) writing systems put some
    vowels over or under the consonant. A tone mark (or other diacritic)
    can go over that. A computer font gives such characters zero width.

    >I assume this means that there is underlying code to make characters
    >appear on screen and paper correctly, but that this code plays havoc with
    >searches ?
      Not exactly. Because the vowel or diacritc has zero width,
    sequences like:

       consonant vowel tone mark
       consonant tone-mark vowel
       consonant tone-mark tone-mark tone-mark tone-mark vowel

    can all have the exact same display (the repeated characters just
    overwrite each other). However, they're obviously not identical
    for searching.

      A language-aware system enforces an interchange standard
    that is usually something like this:

     - a diacritic must be used in conjunction with a consonant,
     - no more than one over- or under-vowel per consonant,
     - no more than one tone mark etc per consonant,
     - the sequence vowel -> tone-mark is legal, but tone-mark -> vowel
       is not.

    These are the practices people generally follow when they write
    with a pencil.
       As a rule, the input application - not the display app - enforces
    this part of the interchange standard. The issue in preparing a
    corpus is, in effect, to simulate re-input, and make the saved text
    obey the interchange standard.

      --Doug



    This archive was generated by hypermail 2b29 : Tue Apr 24 2001 - 10:30:35 MET DST