[Corpora-List] Problems with the architecture for a large corpus (for SQL or database gurus)

From: Mark Davies (mdavies@ilstu.edu)
Date: Tue Oct 22 2002 - 07:44:27 MET DST

  • Next message: Ha Le An: "RE: [Corpora-List] alternative methods/approaches in automatic term extraction"

    RE: [Corpora-List] Legal aspects of corpora compilingI’m experimenting with
    a change to the architecture of a 100 million word corpus of Spanish that I’
    ve created (www.corpusdelespanol.org). With my current solution, things
    work pretty well. Nevertheless, for some reasons listed below, I'm
    considering a change to another database architecture, but the queries with
    the new approach are too slow. Here’s an overview:

    ================
    CURRENT APPROACH
    ================

    I use large databases of n-grams that contain all of the distinct
    1,2,3,4-grams in the entire 100 million word corpus, along with their
    frequency in several historical periods and registers of Modern Spanish.
    Using this approach, a typical query looks like the following:

    select top 300 * from x3 where
    w1 in ('le', 'les')
    and
    w2 in (select w1 from x_L where x1 = 'hacer')
    and
    w3 in (select w1 from x_c where x1 = 'v_inf')
    order by x19 DESC

    where w1,w2, and w3 are the three word slots in x3 (a table containing all
    of the 40+ million distinct 3-grams in the corpus), and x_L and x_c are two
    separate tables containing the lemma or part of speech (column x1) for about
    1,000,000 distinct forms (w1). ( Note: in order to run this query, the user
    submits [ le/les hacer.* *.v_inf ] (without the brackets)).

    The preceding example, then, selects all of the distinct three word strings
    composed of [le or les] + [a form of hacer "to do"] + [an infinitive], e.g.
    le hice decir, les hagan comprar, etc., and then sorts them by their
    frequency in the 1900s (column x19).

    This approach has a number of advantages, including the ability to search
    directly by frequency in a given sub-corpus (i.e. all of the matching
    strings that occur more than three times in the 1900s, but no times in the
    1700s or 1800s), and it is also very fast -- about one or two seconds for
    rather complicated searches like this.

    The downside, and the reason I’m considering changing the architecture of
    the database, is that every time I add or delete a text from the corpus, I
    have to re-calculate all of the 1,2,3,4-grams tables, which is quite
    time-consuming. In addition, if I want to allow search within a five word
    window, for example, I then have to go back and create a new table with all
    of the distinct 5-grams in the entire corpus, which also takes a long time.

    ================
    NEW APPROACH
    ================

    So what I’ve been trying to do is to use a new database that is composed of
    all of the sequential words in the corpus. For a 100 million word corpus,
    for example, there would be 100 million rows with the two columns: ID
    (offset value) and W1 (word), e.g.

    1 I
    2 saw
    3 the
    4 man
     . . .
    100000000 xxx

    To extract strings, I would then use self-joins on this one table, in which
    [ID], [ID-1], [ID+1] etc are used to find preceding and following words,
    e.g.:

    select count(*),w1.w1,w2.w1,w3.w1 from
    ((select w1, ID+1 as ID from seq where w1 in ('le',’les’)) w1
    inner join
    (select w1, ID as ID from seq where w1 in (select w1 from x_c where x1 =
    ‘hacer’)) w2
    on w2.ID = w1.ID)
    inner join
    (select w1, ID-1 as ID from seq where w1 in (select w1 from x_l where x1 =
    'ver')) w3
    on w3.ID=w1.ID
    group by w1.w1,w2.w1,w3.w1

    The will produce the same results set as the previous query (le hice decir,
    les hagan comprar, etc).

    ================
    THE PROBLEM
    ================

    The problem is, the self-join solution is extremely slow. I have a SQL
    Server 7.0 database with a clustered index on W1 (sequential words) and a
    normal index on ID, as well as clustered indexes on X1 (the lemma and part
    of speech column) in the POS and lemma tables. Even with all of this,
    however, a self-join query like the one just listed takes about 15 seconds
    on my machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).
    Using the n-grams approach, however, it takes about one or two seconds.

    For all of you SQL gurus out there, am I doing something wrong in terms of
    the database architecture of the table on which I do the self-join, or else
    in the SQL syntax itself?

    As far as I’m aware, this self-join solution -- with all of the words in the
    corpus as sequential rows in the database -- is the one used by many other
    large corpora. But I’m guessing that they don’t take 15-20 seconds for a
    query involving a three word string like [exact words] + [lemma for one
    verb] + [POS, e.g. infinitives]. So they must be doing something different.

    Any suggestions? Thanks in advance.

    Mark Davies
    Illinois State University



    This archive was generated by hypermail 2b29 : Tue Oct 22 2002 - 20:49:45 MET DST