Re: Corpora: Diacritics and "deviant" texts in corpora

From: Marco Antonio Esteves da Rocha (marcor@cce.ufsc.br)
Date: Sun Apr 22 2001 - 18:15:13 MET DST

  • Next message: Roberta Facchinetti: "Corpora: Modality in Contemporary English"

    I will assume that the questions I ask Doug are of general interest in the
    spirit of corpus linguistics principles of raw data availability pointed
    out by Jem. I hope the assumption is not misguided. Most of the questions
    are likely to be a result of ignorance regarding the languages in
    question or regarding computing.

    On Sun, 22 Apr 2001, Doug Cooper wrote:

    > At 22:39 21/4/01 +0100, Jem Clear wrote:
    > >I must urge Tadeusz Piotrowski **not** to standardize or
    > >normalize Polish e-mails or news agency feeds when
    > >adding them to his corpus.
    >
    > I agree entirely with Jem's point, with an exception related
    > to zero-width characters (diacritics, vowels, etc.) in
    > Southeast Asian languages like Thai. I don't know if
    > you have this problem in Europe.
    >

    Now, could you be a little more precise about the meaning of a "zero-width
    vowel character" ? I guess I understand a "zero-width diacritic character"
    because, if I got you right, that's what happens in Portuguese and
    Spanish, but a zero-width vowel ? Is it anything like a vowel character
    that signals length ? Or perhaps some form of composite vowel sound that
    is not a diphtong ?

    > Around here, the entry application enforces a local interchange
    > standard on the order of such characters (usually it's 'store
    > as normally hand-written'; eg. vowels before tone-marks, and
    > no more than one of each).
    >

    That means you may have a sequence of characters such as:

    1. first a character that stands for a vowel
    2. then a character that signals vowel length
    3. then a character that marks tone

    Is that it in normally hand-written text ?
      
    > However, because the characters are zero-width, an input
    > application that isn't aware -- which can result from using
    > a keyboard manager with the standard international version
    > of the OS -- will permit both wrong orders, and multiple,
    > overwritten characters. These appear correct on screen or
    > paper, but can't be searched properly.
    >

    I assume this means that there is underlying code to make characters
    appear on screen and paper correctly, but that this code plays havoc with
    searches ? Does "correct" mean "normally hand-written" ? I have problems
    following your reasoning here, possibly as a result of ignorance, but
    could you clarify the difference between correct and normally hand-written
    ?
       
    > This is less of a headache for Thai, which has had an
    > interchange standard for some time, than for Lao, Khmer,
    > Burmese, etc. In any case, I clean up this kind of stuff
    > (ie. multiple or misordered diacritics) in corpus building,
    > but no more than to the point of making what the user
    > can search match what he or she sees.
    >

    Do you mean you clean up multiple or disordered diacritics that appear
    correctly on screen and paper ? I'm afraid I'm lost here. As this is
    potentially uninteresting to other members of the list, could you point
    me to a site where there are explanations about this in English or
    French ?

    Marco Rocha



    This archive was generated by hypermail 2b29 : Sun Apr 22 2001 - 15:14:25 MET DST