Corpora: Corpus indexing program

From: E.S. (
Date: Sat Jun 01 2002 - 12:53:16 MET DST

  • Next message: Douglas Rohde: "Re: Corpora: closed class word list"

    Can anyone direct me to a corpus indexing program that does fast
    searches. I have dabbled in Wordsmith and Winconcord for Windows, but
    neither does a complete index of my entire database of text,
    approximately 2 GB, and both seem to take about 20 minutes on a Pentium
    233 for one search.

    My database is a collection of texts on U.S. and British literature,
    history, and culture; linguistics; writing studies; history of English;
    philosophy; critical and cultural theory; psychology. The database also
    includes daily postings from about 20 listservs and several newspapers
    and journals, to include corpus linguistics, media studies, and the
    other disciplines already mentioned.

    This database serves two purposes for me and my students: a somewhat
    customized research database of scanned and WWW material and an
    extensive searchable corpus for language research (I believe I have a
    much better collection of texts than does either the BNC or Collins on
    the WEB. However my collection consists of at least 85% professional
    American-English and is not tagged, as you will see below.)

    Currently I use Asksam 4 and 5 and Adobe Acrobat 4 and 5 to search this
    2 GB database. Adobe Acrobat will accomplish any search in 15 seconds
    or less. It's great for locating information, and quite useful for
    looking at words and phrases in context, though it doesn't give any
    empirical data. But the whole database is indexed and searches are very
    fast, and the current database can always easily be transferred to some
    other system.

    I also use Asksam 4 and 5. It, too, indexes my entire database, but it
    has the advantage of being able to do more complex proximity searches,
    so any permutation is possible. The only drawback is it's slowness
    (perhaps a minute or two on a P-233 machine for any complex search)and
    that it, too, doesn't give empirical data. At least Adobe Acrobat will
    yield a list of all files that contain instances of the queried string.
    Asksam on the other hand yields one file at a time.

    I would be grateful if anyone can point me toward a program that is
    combination of the database programs I am already using and a bona fide
    corpus program.

    Thanks for the consideration,


    PWSZ/NKJO Poland

    This archive was generated by hypermail 2b29 : Wed Jun 05 2002 - 13:04:04 MET DST