Re: [Corpora-List] On tools for indexing and searching large corpora

From: Pavel Rychly (pary@textforge.cz)
Date: Fri Nov 22 2002 - 01:41:43 MET

  • Next message: Rayson, Paul: "[Corpora-List] Corpus Linguistics 2003 conference 3rd CFP"

    On Tue, Nov 19, 2002 at 02:03:59PM +0300, Serge Sharoff wrote:
    > What is the technology used in the BNC and other annotated corpora of
    > similar size? Can it be applied in this case (given the need to cope with
    > possible ambiguity)? The corpus uses Win-1251 encoding, but eventually I
    > plan to convert it to Unicode. Any suggestions?

    At the NLPlab of FI MU, Brno, Czech Republic, the Manatee system is in
    regular use. We use corpora (including BNC) of many different
    languages and encodings. Even the largest Czech corpus (more than 620
    million tokens) has ambiguous lemma and grammatical annotation. The
    Manatee handles pretty well both ambiguity and large size of corpora.
    The Manatee system is available from www.textforge.cz

    Best
    Pavel



    This archive was generated by hypermail 2b29 : Fri Nov 22 2002 - 01:47:33 MET