Corpora: (Robust and fast) PC concordance programs

From: Mark Davies (mdavies@ilstu.edu)
Date: Tue Jan 23 2001 - 22:45:54 MET

  • Next message: Mila Ramos-Santacruz: "Corpora: Job for Computational Linguists at SRA International, Inc"

    I'm writing a paper in which I need to briefly comment on concordance
    programs for the PC platform, and how they handle large corpora ( >
    100,000,000 words). I'm trying to address the question of whether or not
    there is a program out there that can search a corpus of this size in a
    matter of a few seconds (perhaps 5-10 seconds or less).

    My own experience is with the following programs:

    -- WordCruncher (for DOS)
    -- TACT
    -- WordSmith
    -- MonoConc

    My experience is with WordCruncher is that it is extremely fast, because it
    creates and searches an every-word index that can be used in successive
    sessions. Search time is usually one or two seconds, even for the most
    complex searches. The problem with this program is that it cannot handle
    corpora larger than about 20,000,000 words because of a limit on the number
    of "unique forms". TACT, if I recall correctly, reachers a limit at a much
    smaller size -- about 10,000,000 words or less.

    Both WordSmith and MonoConc can handle large files --I've used both
    successfully on corpora over 100,000,000 words. The limitation of both of
    these programs is that they do not create an every-word index that can be
    searched in later sessions. This means that it has to traverse the entire
    corpus for each separate search, which takes 3-4 minutes for a 100,000,000+
    word corpus.

    My question then: are there in fact any **PC-based**, **commercially
    available** programs that can search 100,000,000+ word corpora in a couple
    of seconds (or less)? I realize that there may be some custom solutions
    that researchers have created, but I need to focus here on
    commercially-available software (or shareware/freeware, if in fact such a
    program existed that meets these specs).

    Thanks in advance for your help.

    Mark Davies

    =======================================
    Mark Davies, Associate Professor, Spanish Linguistics
    Dept. of Foreign Languages, Illinois State University
    Normal, IL 61790-4300

    Voice:309/438-7975 email:mdavies@ilstu.edu
    Fax:309/438-8038 http://mdavies.for.ilstu.edu/
    =======================================



    This archive was generated by hypermail 2b29 : Tue Jan 23 2001 - 22:35:03 MET