Re: Corpora: Using corpora to correct grammar

From: Tony Berber Sardinha (tony4@uol.com.br)
Date: Thu Oct 11 2001 - 19:10:47 MET DST

  • Next message: Jean Veronis: "Re: Corpora: the At sign"

    Dear Mark

    I've been thinking about such a tool for a long time - a collocation / pattern
    checker would be a great tool for language learners.

    I tried this back in 1995/1996 for a paper I presented at the Aston Corpus
    Seminar. At the time extracting n-grams was very hard on a Windows/DOS-based PC
    but today this is much simpler using a program such as WordSmith Tools (with the
    clusters options activated) or scripts such as that incorporated in the Brill's
    tagger distribution (bigram-generate.prl, which can be adapted to find 3, 4, 5
    etc grams and then run using DOS versions of awk and Windows ActivePerl).

    More recently, I compared the frequency of 3-grams in two learner corpora, one
    of texts written by Brazilian EFL students and the other of essays written by
    students from an Anglo-Brazilian bilingual school. The comparison was carried
    out using lists of the most frequent 3-grams in English, representing 10
    frequency bands. I then used Unix-like tools unning in DOS (uniq, grep, etc) to
    find how many 3-grams of each frequency band were present in each corpus. This
    is similar to the kind of frequency analysis that Paul Nation's 'range' program
    produces.

    One practical problem I see in doing this kind of work is extracting n-grams
    from a large corpus such as the BNC on a PC - WordSmith Tools will stop
    processing the corpus on my machine (P III 550MhZ, 128 Mb RAM) after a few
    million words. One possible solution is to split the corpus into samples,
    extract the clusters from those samples and then join the lists with the 'Merge
    Lists' function. The disadvantage here is that clusters with a frequency that's
    below the cut-off point (eg 1) in two or more separate corpus samples will not
    be included in the final merged list, resulting in inexact frequencies for the
    whole corpus. The other practical problem is that WordSmith Tools will not allow
    you to pull out n-grams of frequency 1 in the learner data, since the minimum
    frequency is 2, but this can be overcome by 'cheating' a little: just choose the
    corpus texts twice, and so n-grams which originally had a frequency of 1 will
    then have a frequency of 2, and will thus be included in the wordlist.

    A problem of a more 'conceptual' kind is of course that clusters formed by
    adjacent words only will not represent the full range of pre-fabs in English or
    Spanish, and so perfectly acceptable patterns in learner compositions may be
    marked as 'suspect' because they did not match any n-grams in the native
    language reference corpus.

    cheers
    tony.
    -------------------------------------
    Dr Tony Berber Sardinha
    LAEL, PUC/SP
    (Catholic University of Sao Paulo, Brazil)
    tony4@uol.com.br
    www.tonyberber.f2s.com

    ----- Original Message -----
    From: "Mark Davies" <mdavies@ilstu.edu>
    To: <corpora@hd.uib.no>
    Sent: quarta-feira, 10 de outubro de 2001 13:14
    Subject: Corpora: Using corpora to correct grammar

    > Is anyone aware of projects that have used very large corpora as a database
    > to help correct compositions written by learners of that language?
    >
    > For example, you might have a 40-50 million word corpus of, let's say,
    > Spanish. First you'd extract all of the 1, 2, and 3 word clusters in the
    > corpus and import this into a database. (Sounds hard, but it's doable --
    > I'm working on something like that right now). Then you'd have a web-based
    > form, for example, where Spanish students could input their 500-1000 word
    > composition. The script would compare every single word and every two and
    > three word cluster in the composition and see if these appear in the
    > database from the multi-million word corpus. At the one word level, it
    > would simply be like a spell checker. At the two and three (and more)
    > level, it would be like a modified grammar checker (except that the
    > database has access to a frequency listing for each bigram or trigram,
    > whereas a grammar checker works on more abstract rules).
    >
    > If the specific two or three word string matches a record in the database
    > (from the multi-million word corpus), then it's marked "OK". If not -- or
    > if it appears at a frequency in the corpus below a certain threshold --
    > then the two or three word string is marked as "suspect", In this case,
    > the script would then look up other forms of the same lemma (and perhaps
    > synonyms for the words as well) in the two or three word strings of the
    > database, and suggest some of these as options to the students.
    >
    > Of course, the generative grammar idea is that there are an infinite number
    > of sentences, so you wouldn't want to try this on 5 or 10 or 15 word
    > strings -- chances are they wouldn't match anything in the 40-50 million
    > word corpus. At the level of two and three word strings, though, I think
    > you'd find a much narrower range of entries in the database, and therefore
    > your ability to predict that these are "suspect" would be much
    > better. (Actually, the number of unique trigrams in a 50 million word
    > corpus ends up being about 20-30 million distinct strings, so it's not THAT
    > limited. And obviously, it requires an "industrial strength" database
    > (Oracle, SQL Server, etc) to efficiently handle a database this size.)
    >
    > The reason for wondering about such a project is that I'm teaching a
    > mid-level Spanish composition course this next year, and I'd like to have
    > some way to automate correction of some of the most common, low-level
    > errors ("tener un buen tiempo", "yo dije a ella", etc). Higher level stuff
    > (disjoint agreement, semantics at the sentential level, etc) would be way
    > beyond the capability of such a program. But I do think that it does have
    > some potential for low-level, narrow-clause type of phenomena.
    >
    > Anyway, have there been projects similar to this in the past? If so, any
    > references would be appreciated. I'll summarize for the list if there's
    > interest. Thanks in advance.
    >
    > Mark Davies
    >
    > ====================================================
    > Mark Davies, Associate Professor, Spanish Linguistics
    > 4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300
    > 309-438-7975 (voice) / 309-438-8083 (fax)
    > http://mdavies.for.ilstu.edu
    > ** Historical and dialectal Spanish and Portuguese syntax **
    > ** Corpus design and use / Web-database scripting / Distance education **
    > =====================================================
    >
    >



    This archive was generated by hypermail 2b29 : Thu Oct 11 2001 - 19:14:58 MET DST