Re: [Corpora-List] On tools for indexing and searching large corpora

From: Olonichev Sergei (olonichev@scnsoft.com)
Date: Thu Nov 21 2002 - 09:28:25 MET

  • Next message: Zhiping Zheng: "Re: [Corpora-List] Looking for some corpora about why-questions, how-questions, and their answers."

    > Dear Serge,
    >
    > If you have a valid XML-encoded corpus (and, basically, if you want to
    > check if it is valid XML), regexes are not the best tool: you could

    Regexes always have been expressive for lingustic queries.

    The search speed depends on index implementation.
    You may have word-based index and may increase the regexp search speed
    drastically, e.g.
    you would like to find the construction: " word1 .+ word2 ",
    so the query should be: echo "word1 & word2" | mgquery | grep -i " word1 .+
    word2 "

    [skipped]

    > Berkeley DB XML: http://www.sleepycat.com/xml/index.html
    >
    > Please let me know your choice.
    > Regards,
    >
    > Sylvain Loiseau
    >
    >
    >
    >
    > ----- Original Message -----
    > From: "Serge Sharoff" <sharoff@aha.ru>
    > To: <corpora@lists.uib.no>
    > Sent: Tuesday, November 19, 2002 12:03 PM
    > Subject: [Corpora-List] On tools for indexing and searching large
    > corpora
    >
    >
    > > Dear all,
    > >
    > > I'm in the process of compiling a corpus of modern Russian
    > comparable to the
    > > BNC in its size and coverage. The format of the corpus is based on
    > TEI, for
    > > instance,
    > > <s id="nashi.535">
    > > ...
    > > <w>глава
    > > <ana lemma="глава" pos="noun" feats="мр,од,ед,им"/>
    > > <ana lemma="глава" pos="noun" feats="жр,но,ед,им"/>
    > > </w>
    > > <w>Владивостока
    > > <ana lemma="Владивосток" pos="noun" feats="мр,но,ед,рд,геог"/>
    > > </w>
    > > ...
    > > </s>
    > > in the first case, the POS tagger detects and cannot resolve an
    > ambiguity
    > > between two possible readings (masc, animate, i.e. the head of, and
    > fem.,
    > > inanimate, i.e. the chapter of), so both analyses are left.
    > >
    > > Currently for searching the corpus I use custom tools written in
    > Perl and
    > > based on regular expressions. As the corpus gets larger (currently
    > 40
    > > million words), the indexing scheme gets totally inefficient and I'm
    > > reluctant to reinvent the wheel by improving it.
    > >
    > > What is the technology used in the BNC and other annotated corpora
    > of
    > > similar size? Can it be applied in this case (given the need to cope
    > with
    > > possible ambiguity)? The corpus uses Win-1251 encoding, but
    > eventually I
    > > plan to convert it to Unicode. Any suggestions?
    > >
    > > Best,
    > > Serge
    > >
    > >
    > >
    > >
    > >
    >
    >



    This archive was generated by hypermail 2b29 : Thu Nov 21 2002 - 09:32:09 MET