[Corpora-List] On tools for indexing and searching large corpora

From: Serge Sharoff (sharoff@aha.ru)
Date: Tue Nov 19 2002 - 12:03:59 MET

  • Next message: mdavies@ilstu.edu: "Re: [Corpora-List] On tools for indexing and searching large corpora"

    Dear all,

    I'm in the process of compiling a corpus of modern Russian comparable to the
    BNC in its size and coverage. The format of the corpus is based on TEI, for
    instance,
    <s id="nashi.535">
    ...
       <w>глава
          <ana lemma="глава" pos="noun" feats="мр,од,ед,им"/>
          <ana lemma="глава" pos="noun" feats="жр,но,ед,им"/>
       </w>
       <w>Владивостока
          <ana lemma="Владивосток" pos="noun" feats="мр,но,ед,рд,геог"/>
       </w>
    ...
    </s>
    in the first case, the POS tagger detects and cannot resolve an ambiguity
    between two possible readings (masc, animate, i.e. the head of, and fem.,
    inanimate, i.e. the chapter of), so both analyses are left.

    Currently for searching the corpus I use custom tools written in Perl and
    based on regular expressions. As the corpus gets larger (currently 40
    million words), the indexing scheme gets totally inefficient and I'm
    reluctant to reinvent the wheel by improving it.

    What is the technology used in the BNC and other annotated corpora of
    similar size? Can it be applied in this case (given the need to cope with
    possible ambiguity)? The corpus uses Win-1251 encoding, but eventually I
    plan to convert it to Unicode. Any suggestions?

    Best,
    Serge



    This archive was generated by hypermail 2b29 : Tue Nov 19 2002 - 12:09:31 MET