[Corpora-List] Corpus-building for minority languages

From: Kevin Patrick Scannell (scannell@slu.edu)
Date: Fri Mar 19 2004 - 17:01:12 MET

  • Next message: Kevin Patrick Scannell: "Re: [Corpora-List] Corpus-building for minority languages"

    I've developed some simple web crawling software
    that is designed to build corpora for minority
    languages quickly and inexpensively. See:

    http://borel.slu.edu/crubadan/

    Thusfar it has been deployed in earnest only for Welsh
    (now approaching 50 million words) and Irish
    (15 million words). The Welsh corpus is being
    used by the lexicographers at the University of Wales
    Dictionary of the Welsh Language:

    http://www.aber.ac.uk/~gpcwww/

    Of course the texts harvested in this way are
    not statistically representative in any sense.
    Nevertheless they are good for lexicography and
    number-crunching for natural language processing.
    And extracting useful subsets shouldn't be hard;
    I've done some of this for the Irish corpus
    already.

    The software has proved to be quite portable
    across languages; it (very roughly) bootstraps
    the language model from some initial "seed" texts
    (or even better an initial word list).
    I've done some experimentaion with several other
    languages: Catalan, Swahili, Maori, Faroese,
    Scottish Gaelic, Walloon, Breton, Cebuano, and Manx
    Gaelic. You can see some results on the
    status page:

    http://borel.slu.edu/crubadan/stadas.html

    Please send me an email if you'd be interested
    in helping develop one of these corpora or in
    trying a new language.

    -Kevin



    This archive was generated by hypermail 2b29 : Fri Mar 19 2004 - 17:35:25 MET