Re: [Corpora-List] On tools for indexing and searching large corpora

From: mdavies@ilstu.edu
Date: Tue Nov 19 2002 - 13:31:19 MET

  • Next message: Arne Fitschen: "Re: [Corpora-List] On tools for indexing and searching large corpora"

    > I'm in the process of compiling a corpus of modern Russian comparable to the
    > BNC in its size and coverage.
    > Currently for searching the corpus I use custom tools written in Perl and
    > based on regular expressions. As the corpus gets larger (currently 40
    > million words), the indexing scheme gets totally inefficient and I'm
    > reluctant to reinvent the wheel by improving it.

    I have a 100 million word corpus of Spanish (www.corpusdelespanol.org) that is
    annotated (POS, lemma, synonyms, etc) and is fairly fast. Even a query like
    [<le> or <les> "3p IndObj" + any form of any synonym of <querer> "to want" +
    infintive, e.g. <le prefiero decir, les querían saludar>] takes only about two
    or three seconds.

    I use relational databases in SQL Server 7.0 to achieve the results. The main
    database is composed of tens of million of distinct n-grams with their
    associated frequencies in several sub-corpora. These are linked to other
    databases containing POS, lemma, and synonym info. The output from the n-
    grams/frequency tables is then used to search the actual, unannotated textual
    corpus itself, which is indexed only with SQL Server Full-Text Indexing.
    Anyway, because all of the tables have clustered indices, you get pretty good
    performance. The one caveat is that my approach works best with
    morphologically more complex languages like Spanish, and it would have to be
    modified for a language like English.

    > What is the technology used in the BNC and other annotated corpora of
    > similar size?

    This is a question that I've asked myself many times. I would love to see a
    book that discussed the approach taken by the BNC, the BoE, CREA, corpora based
    on the IMS Corpus Workbench (such as O Público), etc to "look under the hood"
    and see how each of these corpora and indexing schemes is organized. As you
    mentioned, as more and more people start creating 100+ million word corpora, it
    would be a shame if they all ended up having to re-invent the wheel.

    Mark Davies
    Illinois State University
    http://mdavies.for.ilstu.edu/

    ------------------------------------------------------------
    Illinois State University Webmail https://webmail2.ilstu.edu



    This archive was generated by hypermail 2b29 : Tue Nov 19 2002 - 13:36:11 MET