Corpora: Diacritics and "deviant" texts in corpora

From: Jem Clear (jem@jemclear.co.uk)
Date: Sat Apr 21 2001 - 23:39:20 MET DST

  • Next message: James L. Fidelholtz: "Re: Corpora: a particular type of sloppiness"

    I must urge Tadeusz Piotrowski **not** to standardize or
    normalize Polish e-mails or news agency feeds when
    adding them to his corpus.

    His question (whether the corpus builder should regard an email
    message which lacks diacritics as defective and should correct the
    defects) seems to me very important for all corpus linguists. My
    mentor and friend John Sinclair has always banged on about keeping the
    corpus data "raw". This is precisely a situation where "improving" the
    data at the time of data capture will lead to horrible confusion.

    For many years people who used the Cobuild Bank of English corpus
    moaned at me because of the "errors" in it. (Indeed, none of us is
    perfect and there are some errors in the corpus!) But often the
    perception that there were too many "errors" in the corpus came about
    simply because the data collected did not conform to the linguists'
    prior assumptions about what English text **should** look like.

    Taking a practical (if somewhat flippant) example: suppose we were to
    correct all non-standard uses of the English apostrophe, such as "I
    love it's nutty taste" -> "I love its nutty taste". It would become
    pointless to conduct corpus investigations into the use of the
    apostrophe in English, because the raw data would have been tampered
    with.

    This notion of "raw" data is crucial: it goes to the very heart of
    corpus linguistics as a distinct and innovative branch of linguistic
    science. If you don't trust or believe in the data you collect, then
    you might just as well invent your own sentences and study them
    instead! Here's one to get you started: "Colourless green ideas sleep
    furiously" -- that should keep a few people going for the next 30
    years...

    Jem Clear

    Jem Clear Ltd
    29 School Road, Moseley, Birmingham, B13 9TF, UK
    Tel & Fax: +44 (0)121 689 3637
    Email: jem@jemclear.co.uk



    This archive was generated by hypermail 2b29 : Sun Apr 22 2001 - 00:35:24 MET DST