Re: Corpora: Diacritics and "deviant" texts in corpora

From: Doug Cooper (doug@th.net)
Date: Tue Apr 24 2001 - 10:27:27 MET DST

Next message: Chantal Perez Hernandez: "Corpora: Corpus software for Unix/Linux Alpha processors"

Previous message: MOCKBA: "Corpora: Corpora abbr list"
Maybe in reply to: Jem Clear: "Corpora: Diacritics and "deviant" texts in corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

At 11:15 22/4/01 -0500, Marco Antonio Esteves da Rocha <marcor@cce.ufsc.br>
wrote:
>I will assume that the questions I ask Doug are of general interest.
Well, these are general problems for preparing corpora in writing
systems that derive from old Indian scripts, and which are read
as syllables, rather than in linear, left-to-right progression.

>Now, could you be a little more precise about the meaning of a "zero-width
>vowel character" ?
Many Southeast Asian (and Indian) writing systems put some
vowels over or under the consonant. A tone mark (or other diacritic)
can go over that. A computer font gives such characters zero width.

>I assume this means that there is underlying code to make characters
>appear on screen and paper correctly, but that this code plays havoc with
>searches ?
Not exactly. Because the vowel or diacritc has zero width,
sequences like:

   consonant vowel tone mark
   consonant tone-mark vowel
   consonant tone-mark tone-mark tone-mark tone-mark vowel

can all have the exact same display (the repeated characters just
overwrite each other). However, they're obviously not identical
for searching.

A language-aware system enforces an interchange standard
that is usually something like this:

- a diacritic must be used in conjunction with a consonant,
- no more than one over- or under-vowel per consonant,
- no more than one tone mark etc per consonant,
- the sequence vowel -> tone-mark is legal, but tone-mark -> vowel
is not.

These are the practices people generally follow when they write
with a pencil.
As a rule, the input application - not the display app - enforces
this part of the interchange standard. The issue in preparing a
corpus is, in effect, to simulate re-input, and make the saved text
obey the interchange standard.

--Doug

Next message: Chantal Perez Hernandez: "Corpora: Corpus software for Unix/Linux Alpha processors"
Previous message: MOCKBA: "Corpora: Corpora abbr list"
Maybe in reply to: Jem Clear: "Corpora: Diacritics and "deviant" texts in corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Apr 24 2001 - 10:30:35 MET DST