Re: [Corpora-List] WANTED: Thai word-segmented corpora

From: Doug Cooper (doug@th.net)
Date: Sun Aug 18 2002 - 11:15:41 MET DST

  • Next message: Nancy Ide: "[Corpora-List] Call for Participation : NLPXML-2002"

    At 14:53 17/8/02 +0200, Petr Sojka wrote:
    >I am looking for word-segmented corpora of Thai.
    >So far, I've found only Orchid corpus, but it is too small for
    >our machine language research.

    Dear Petr:

    I see (or get) queries like this often enough to motivate the
    following general comment (apologies in advance if I've
    jumped to an incorrect conclusion about your goals). Thai
    appears to attract programming interest because it uses:

      a) a non-segemented writing system that
      b) has lots of text in electronic form available, and
      c) uses nice, straigtforward, one-byte encoding, and yet is
      d) so foreign that segmenting problems are not obvious ;-).

      I also see a steady stream of papers titled 'Yet Another
    Segmentor/Hyphenator/Syllabifier for Thai,' all of which
    use data sets like the Orchid Corpus for both training and
    testing, and which usually report 94-97% success.

      Folks getting into this area should be advised that:

      a) beyond the trivial cases (and despite what's taught in Thai
    grammar schools;-), there is no general agreement on how
    written Thai should be segmented into words;

      b) corpora like Orchid tend to be skewed by their developers'
    opinions on the subject, and/or to have been automatically
    generated by systems that use similar corpora for training;

      c) thus, using them as gold standards won't teach very much.

      IMHO, linguistic research that requires segmented Thai data
    (and by implication Lao, Burmese, and Khmer) is likely to remain
    in its present rut until the focus shifts to some form of phrase
    bracketing, rather than segmentation.

      Good luck,
      Doug Cooper



    This archive was generated by hypermail 2b29 : Sun Aug 18 2002 - 11:35:56 MET DST