Re: Corpora: Diacritics and "deviant" texts in corpora

From: Doug Cooper (doug@th.net)
Date: Sun Apr 22 2001 - 14:07:10 MET DST

  • Next message: Marco Antonio Esteves da Rocha: "Re: Corpora: Diacritics and "deviant" texts in corpora"

    At 22:39 21/4/01 +0100, Jem Clear wrote:
    >I must urge Tadeusz Piotrowski **not** to standardize or
    >normalize Polish e-mails or news agency feeds when
    >adding them to his corpus.

    I agree entirely with Jem's point, with an exception related
    to zero-width characters (diacritics, vowels, etc.) in
    Southeast Asian languages like Thai. I don't know if
    you have this problem in Europe.

      Around here, the entry application enforces a local interchange
    standard on the order of such characters (usually it's 'store
    as normally hand-written'; eg. vowels before tone-marks, and
    no more than one of each).

      However, because the characters are zero-width, an input
    application that isn't aware -- which can result from using
    a keyboard manager with the standard international version
    of the OS -- will permit both wrong orders, and multiple,
    overwritten characters. These appear correct on screen or
    paper, but can't be searched properly.
      
      This is less of a headache for Thai, which has had an
    interchange standard for some time, than for Lao, Khmer,
    Burmese, etc. In any case, I clean up this kind of stuff
    (ie. multiple or misordered diacritics) in corpus building,
    but no more than to the point of making what the user
    can search match what he or she sees.

      -- Doug Cooper



    This archive was generated by hypermail 2b29 : Sun Apr 22 2001 - 14:06:41 MET DST