Corpora: Summary: Using corpora to correct grammar

From: Mark Davies (mdavies@ilstu.edu)
Date: Mon Oct 15 2001 - 15:03:00 MET DST

  • Next message: Mark Davies: "Corpora: PC-based programs to create lists of n-grams"

    Thanks to the following people who responded to my query about using large
    corpora as a database to help correct student essays.

    Pete Whitelock
    Mike O'Connell
    Gosee Bouma
    John Milton
    Oliver Mason
    Tom Vanallemeersch
    Tony Berber Sardinha

    Not only do the replies address the question of the methodological issues
    involved, but there is also good insight into some of the practical issue
    of creating the lists of n-grams that would form the database.

    Since nearly all of the replies were sent directly to me, rather than to
    the list, I'll re-post them here.

    From: Pete Whitelock <pete@sharp.co.uk>
    From: Gosee Bouma <gosse@let.rug.nl>

    What you suggest has been tried, for English, at the Educational Testing
    Service - see:

    Martin Chodorow and Claudia Leacock. 2000. An unsupervised method for
    detecting grammatical errors. In Proceedings of the 1st Annual Meeting of
    the North American Chapter of the Association for Computational Linguistics,
    140-147.

    Their web page is at
    http://www.etstechnologies.com/scoringtech-scientists.htm

    ----------------------------------------------------------------

    From: Mike O'Connell <Michael.Oconnell@Colorado.EDU>

       A similar project using Latent Semantic Analysis at CU-Boulder uses
    latent semantic indexing to permit comparisons of essays to known
    standards. The basic technology is derivative of work on information
    retrieval, but they're also trying out language models using ngrams of
    various lengths for different purposes. They have a research project
    called I believe summary street that does automatic grading of essays by
    comparing them to a model based on graded essays, although I don't think
    they've formally applied the ngram language model type approach that
    you're describing in any published work.
       So, while there are similarities, the projects aren't exactly the same,
    but anyway just FYI.

    ----------------------------------------------------------------

    From: John Milton <lcjohn@ust.hk>

    I tried somewhat the opposite approach, but ran into some of the same
    problems that I would using your approach, at least with the type of
    errors produced by my students. I originally thought it would be a good
    idea to extract a set of gramatically impossible collocations from a large
    (25 million word) corpus of Cantonese-speakers' English texts and flag
    these whenever a student produced them. The trouble is that the corpus
    that I have (essays written by students graduating from secondary school
    in Hong Kong with barely passing grades in English) contained very few of
    'illegal' short collocations. Their biggest problem is in the use of
    particles (e.g. "This will benefit to me."). It's easy to figure out the
    intralingual confusion that results in this type of error, but it's
    difficult in practise to reliably flag (e.g., they also drop the
    aux verb and produce "This benefit to me."). I wrote a toy program
    anyway to see what type of reliability I might get, and in fact got too
    many false positives and false negatives for it to be useful. Attacking
    these types of problems would require an authorable grammar checker whose
    analysis, especially parsing, is really reliable. I know of no such
    program currently available (e.g. I tried L&H's Chinese grammar checker a
    few years ago).

    Nevertheless, I'd be interested in hearing how your approach works with
    the types of strings your students produce.

    ----------------------------------------------------------------

    From: Oliver Mason <oliver@clg.bham.ac.uk>

    Just an additional idea: you could do the same not only with words, but
    also POS tags, or even chunked phrase tags, which might give you a
    wider coverage of grammatical errors. Depending on the granularity of
    the tagset certain errors such as word order or noun-verb agreement
    might be detectable.

    You could then flag something as "verb expected", or "plural noun
    expected" when there is a mismatch. Not sure how exactly that would
    work, as it is more likely to be probabilities instead of absolute
    judgements.

    Anyway, your project sounds like a very interesting idea.

    ----------------------------------------------------------------

    From: Tom Vanallemeersch <tom.vanallemeersch@lantworks.com>

    you may solve the problem of looking up word clusters in a way which doesn't
    need an "industrial strength" database - though a lot of memory will still be
    needed. This can be done using a so-called Patricia array, which is created by
    storing a number of text positions in an array and sorting them on the string
    starting at the text position.

    In your case, you'd have to store the text positions at which words start, and
    sort these positions on the word starting at the position (e.g. if the word
    "house" is at position 15400 and the word "single" at position 10000, then
    position 15400 would precede position 10000 in the array). Then, using binary
    search, it is possible to find in the Patricia array whether a word or word
    group (any length) is present in the corpus, and what the frequency of the
    word/word group is.

    There are some memory issues:
    - given a 50 million word corpus, the Patricia array will need 50 million x 4
    bytes = more or less 200 MB; so you would need at least 256MB RAM if you use a
    PC
    - when creating or searching the array, access to the corpus is needed; this
    may imply that the program which creates/searches the array may need to do a
    lot of lookups on the hard disk (unless you can store the whole corpus in RAM,
    which seems difficult); this is less a problem for lookup than for creating the
    array

    ----------------------------------------------------------------

    From: Tony Berber Sardinha <tony4@uol.com.br>

    I've been thinking about such a tool for a long time - a collocation / pattern
    checker would be a great tool for language learners.

    I tried this back in 1995/1996 for a paper I presented at the Aston Corpus
    Seminar. At the time extracting n-grams was very hard on a Windows/DOS-based PC
    but today this is much simpler using a program such as WordSmith Tools
    (with the
    clusters options activated) or scripts such as that incorporated in the Brill's
    tagger distribution (bigram-generate.prl, which can be adapted to find 3, 4, 5
    etc grams and then run using DOS versions of awk and Windows ActivePerl).

    More recently, I compared the frequency of 3-grams in two learner corpora, one
    of texts written by Brazilian EFL students and the other of essays written by
    students from an Anglo-Brazilian bilingual school. The comparison was carried
    out using lists of the most frequent 3-grams in English, representing 10
    frequency bands. I then used Unix-like tools unning in DOS (uniq, grep, etc) to
    find how many 3-grams of each frequency band were present in each corpus. This
    is similar to the kind of frequency analysis that Paul Nation's 'range' program
    produces.

    One practical problem I see in doing this kind of work is extracting n-grams
    from a large corpus such as the BNC on a PC - WordSmith Tools will stop
    processing the corpus on my machine (P III 550MhZ, 128 Mb RAM) after a few
    million words. One possible solution is to split the corpus into samples,
    extract the clusters from those samples and then join the lists with the 'Merge
    Lists' function. The disadvantage here is that clusters with a frequency that's
    below the cut-off point (eg 1) in two or more separate corpus samples will not
    be included in the final merged list, resulting in inexact frequencies for the
    whole corpus. The other practical problem is that WordSmith Tools will not
    allow
    you to pull out n-grams of frequency 1 in the learner data, since the minimum
    frequency is 2, but this can be overcome by 'cheating' a little: just
    choose the
    corpus texts twice, and so n-grams which originally had a frequency of 1 will
    then have a frequency of 2, and will thus be included in the wordlist.

    A problem of a more 'conceptual' kind is of course that clusters formed by
    adjacent words only will not represent the full range of pre-fabs in English or
    Spanish, and so perfectly acceptable patterns in learner compositions may be
    marked as 'suspect' because they did not match any n-grams in the native
    language reference corpus.

    ====================================================
    Mark Davies, Associate Professor, Spanish Linguistics
    4300 Foreign Languages, Illinois State University, Normal, IL 61790-4300
    309-438-7975 (voice) / 309-438-8083 (fax)
    http://mdavies.for.ilstu.edu/

    ** Corpus design and use / Web-database programming and optimization **
    ** Historical and dialectal Spanish and Portuguese syntax / Distance
    education **
    =====================================================



    This archive was generated by hypermail 2b29 : Mon Oct 15 2001 - 13:15:45 MET DST