Corpora: Using corpora to correct grammar

From: Mark Davies (mdavies@ilstu.edu)
Date: Wed Oct 10 2001 - 18:14:40 MET DST

  • Next message: Andrew Harley: "Re: Corpora: Q: English/German/Polish thesauri"

    Is anyone aware of projects that have used very large corpora as a database
    to help correct compositions written by learners of that language?

    For example, you might have a 40-50 million word corpus of, let's say,
    Spanish. First you'd extract all of the 1, 2, and 3 word clusters in the
    corpus and import this into a database. (Sounds hard, but it's doable --
    I'm working on something like that right now). Then you'd have a web-based
    form, for example, where Spanish students could input their 500-1000 word
    composition. The script would compare every single word and every two and
    three word cluster in the composition and see if these appear in the
    database from the multi-million word corpus. At the one word level, it
    would simply be like a spell checker. At the two and three (and more)
    level, it would be like a modified grammar checker (except that the
    database has access to a frequency listing for each bigram or trigram,
    whereas a grammar checker works on more abstract rules).

    If the specific two or three word string matches a record in the database
    (from the multi-million word corpus), then it's marked "OK". If not -- or
    if it appears at a frequency in the corpus below a certain threshold --
    then the two or three word string is marked as "suspect", In this case,
    the script would then look up other forms of the same lemma (and perhaps
    synonyms for the words as well) in the two or three word strings of the
    database, and suggest some of these as options to the students.

    Of course, the generative grammar idea is that there are an infinite number
    of sentences, so you wouldn't want to try this on 5 or 10 or 15 word
    strings -- chances are they wouldn't match anything in the 40-50 million
    word corpus. At the level of two and three word strings, though, I think
    you'd find a much narrower range of entries in the database, and therefore
    your ability to predict that these are "suspect" would be much
    better. (Actually, the number of unique trigrams in a 50 million word
    corpus ends up being about 20-30 million distinct strings, so it's not THAT
    limited. And obviously, it requires an "industrial strength" database
    (Oracle, SQL Server, etc) to efficiently handle a database this size.)

    The reason for wondering about such a project is that I'm teaching a
    mid-level Spanish composition course this next year, and I'd like to have
    some way to automate correction of some of the most common, low-level
    errors ("tener un buen tiempo", "yo dije a ella", etc). Higher level stuff
    (disjoint agreement, semantics at the sentential level, etc) would be way
    beyond the capability of such a program. But I do think that it does have
    some potential for low-level, narrow-clause type of phenomena.

    Anyway, have there been projects similar to this in the past? If so, any
    references would be appreciated. I'll summarize for the list if there's
    interest. Thanks in advance.

    Mark Davies

    ====================================================
    Mark Davies, Associate Professor, Spanish Linguistics
    4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300
    309-438-7975 (voice) / 309-438-8083 (fax)
          http://mdavies.for.ilstu.edu
    ** Historical and dialectal Spanish and Portuguese syntax **
    ** Corpus design and use / Web-database scripting / Distance education **
    =====================================================



    This archive was generated by hypermail 2b29 : Wed Oct 10 2001 - 18:03:35 MET DST