Re: [Corpora-List] On tools for indexing and searching large corpora

From: Serge Sharoff (sharoff@aha.ru)
Date: Sun Dec 08 2002 - 16:29:23 MET

  • Next message: Menno van Zaanen: "[Corpora-List] New Deadline: Call for papers: Special Issue Pattern Recognition"

    Dear all,

    Some time ago I sent a query to the Corpora List on the topic of indexing
    and searching tools for BNC-like corpora (of about 100 MW). The reason for
    the query is that using a 100 MW corpus without a reasonably fast and
    compact indexing scheme is a nightmare. (the original query is listed at the
    end of the message).

    The responses I got can be summarized in three categories:
    1. a relational database can be used. The corpus in XML is converted into a
    set of tables (a database can have tools for importing XML files or corpora
    can be preprocessed for importing them as plain texts). Queries to the
    database are based on SQL (possibly with a more user-friendly interface, an
    example of this approach is the Spanish corpus by Mark Davies
    http://www.corpusdelespanol.org). Another possibility is to use the Berkley
    DB (http://www.sleepycat.com/xml/) which can load XML documents and uses
    XPath as the query language (now in the alpha release);
    2. the IMS Corpus WorkBench can be used. It can handle 300+ MW corpora
    successfully, though it uses a specific input format (not TEI) and it is
    unclear, whether and how it can handle the ambiguity in annotations
    (multiple <ana> tags). This is also the software that works with the Uppsala
    Corpus (http://www.sfb441.uni-tuebingen.de/b1/en/korpora.html),
    3. the new BNC indexer, which is designed to work with any tagging scheme.
    Now it is in its testing phase. By definition, it is aimed at handling very
    large corpora and uses SARA as the query interface
    (http://www.hcu.ox.ac.uk/SARA).

    I'd prefer the third option, when it is available, though other options can
    be useful, depending on your corpus. I tested the alpha release of Berkley
    DB XML on my 40 MW corpus. It seems that it copes well with megaword data
    and Unicode characters.

    Many thanks for responses from
    Lou Burnard <lou.burnard@computing-services.oxford.ac.uk>
    Mark Davies <mdavies@ilstu.edu>
    Arne Fitschen <fitschen@ims.uni-stuttgart.de>
    Sylvain Loiseau <sylvain@toucheraveclesyeux.com>
    Sergei Olonichev <olonichev@scnsoft.com>

    Best wishes,
    Serge

    ----- Original Message -----
    From: Serge Sharoff <sharoff@aha.ru>
    To: <corpora@lists.uib.no>
    Sent: Tuesday, November 19, 2002 2:03 PM
    Subject: [Corpora-List] On tools for indexing and searching large corpora

    > Dear all,
    >
    > I'm in the process of compiling a corpus of modern Russian comparable to
    the
    > BNC in its size and coverage. The format of the corpus is based on TEI,
    for
    > instance,
    > <s id="nashi.535">
    > ...
    > <w>глава
    > <ana lemma="глава" pos="noun" feats="мр,од,ед,им"/>
    > <ana lemma="глава" pos="noun" feats="жр,но,ед,им"/>
    > </w>
    > <w>Владивостока
    > <ana lemma="Владивосток" pos="noun" feats="мр,но,ед,рд,геог"/>
    > </w>
    > ...
    > </s>
    > in the first case, the POS tagger detects and cannot resolve an ambiguity
    > between two possible readings (masc, animate, i.e. the head of, and fem.,
    > inanimate, i.e. the chapter of), so both analyses are left.
    >
    > Currently for searching the corpus I use custom tools written in Perl and
    > based on regular expressions. As the corpus gets larger (currently 40
    > million words), the indexing scheme gets totally inefficient and I'm
    > reluctant to reinvent the wheel by improving it.
    >
    > What is the technology used in the BNC and other annotated corpora of
    > similar size? Can it be applied in this case (given the need to cope with
    > possible ambiguity)? The corpus uses Win-1251 encoding, but eventually I
    > plan to convert it to Unicode. Any suggestions?
    >
    > Best,
    > Serge
    >
    >
    >
    >
    > __________
    > Некоторые падают, а некоторые нет - http://www.newhost.ru
    >



    This archive was generated by hypermail 2b29 : Sun Dec 08 2002 - 18:56:35 MET