Corpora: Robert De Beaugrande, Chomsky, and corpus linguistics

From: ramesh@clg.bham.ac.uk
Date: Thu Apr 26 2001 - 15:56:09 MET DST

  • Next message: Tadeusz Piotrowski: "Corpora: lexicography software"

    I completely agree with Robert de Beaugrande's paragraph 65:
    "65. As a corpus gets larger, it does not simply show us the same data
    multiplied out, eg., each item being ten times as frequent in a corpus
    ten times as large. Instead, the larger corpus both turns up fresh data
    that did not appear at all in the smaller ones and displays the previous
    data in steadily finer delicacy for the range and frequency of the
    combinations. Hosts of regularities emerge that escaped notice in smaller
    data sets, and would elude unguided intuition and introspection. [...]
    Instead of coverage, convergence, and consensus decreasing when natural
    language data get rewritten into a formal notation, they are now increasing
    when data get treated in their naturally occurring formats."

    I only partly agree with paragraph 66:
    "66. Conversely, the corpus highlights the improbable and unnatural
    quality of invented data like 'John is eager to please'. Typical contexts
    of real discourse call for less simple-minded and peremptory utterances.
    For example, all three instances of 'eager to please' in the Bank of
    English have a Direct Object Target and a more interesting Subject Agent
    than the legendary 'John'. eg., the 'government' keen to 'please' powerful
    forces such as 'wealth' and 'the Church'
    [18] <a government offical who is eager to please the wealth goddess>
    [19] <the Sandinstas. The government is eager to please the church>"

    The general point "the corpus highlights the improbable and unnatural
    quality of invented data" is certainly valid. I have found many invented
    examples in dictionaries, other language reference books, and linguistics
    textbooks which simply are not reflected in corpus data (to quote just a
    couple of examples:
    "Don't hold the gun by the business end" in an EFL dictionary,
    only one example of "by the business end" in the Bank of English 418
    million word corpus:
     You know, the sort produced by the business end of cows.
    Out of 151 examples for "the business end", 45 are for "at the business
    end", of which 32 are for "at the business end of"; 14 examples of
    "on the business end" of which 13 are for "on the business end of";
    10 for "with the business end" of which 9 are "with the business end of";
    etc. The point is, 102 of 151 examples are in a prepositional phrase,
    so the *colligation* PREP+the+business+end is well-attested, but
    *not the collocation* with the lexical item "by" representing the
    class PREP. More importantly, 115 of the 151 examples are followed
    by "of", which is absent in the dictionary example. So the following
    colligation has escaped notice, using intuition alone.

    Of course, another problem with "Don't hold the gun by the business end"
    is its limited contextualizability: how many of us would ever utter such
    a sentence (a parent to a child in a lax-gun-law state, a training officer
    in a police academy/the army?) and wouldn't we be more emphatic (e.g.
    Don't *ever* hold the gun by the business end)?

    "The plane overshot the runway" is another dictionary example, but
    in such a truncated form, it omits the fatal real-world consequences...
    "The arrow/missile overshot the target" rightly introduces the
    collocate "target", but most modern corpus examples are for governments
    and other organizations overshooting *financial targets*...)

    Unfortunately, Robert de Beaugrande's Chomskyan example "John is eager
    to please" is in fact well attested in the current Bank of English corpus,
    which actually adds even more substance to his point in paragraph 65
    about larger corpora. He was evidently using a much earlier - and smaller -
    version of the Bank of English, if it only had 3 examples of "eager to
    please". I have just checked in the 418 million word Bank of English,
    and there are 168 examples of "eager to please". 115 of the 168 are
    for the predicative use ("X is eager to please") or appositional use
    ("X, eager to please, is/does something", or sometimes sentence-initial,
    "Eager to please, X is/does something"). On a more delicate level (again
    supporting R de B's para 65), proper names (like "John") are much rarer
    than personal pronouns, and the phrase is often part of a list of
    attributes, e.g. "All were friendly, helpful and eager to please.",
    many examples have adverbial modifiers, e.g. desperately eager to please,
    touchingly eager to please, or just simple grading adverbs like "so, very,
    too"). Another 23 of the 168 examples are for attributive use
    ("the/an eager-to-please X"),
    describing people, their face/expression/behaviour/attitude, etc,
    object not mentioned,
    usually hyphenated. Only 31 out of 168 examples specify the object,
    i.e. who the people are trying to please, in examples such as
    "HE was eager to please manager Vialli" or
    "a candidate eager to please all sides".
    The evidence therefore suggests that "eager to please" is becoming
    a fixed phrase, and the direct object of the verb "please" is not
    usually mentioned (although in some contexts it may be implied,
    or picked up in a looser contextual relationship in a subsequent
    sentence). When the object is specified, the phrase seems to lose
    its feeling of fixity (an illustration of the ability of speakers
    to oscillate between the "idiom principle" and "open choice principle"
    as outlined by John Sinclair some years ago).

    Terry Murphy is right to suggest that
    " Chomsky's comment about corpus lingustics not existing seems
    to be a logical response from someone whose whole enterprise would be
    undermined by the widespread adoption of real data"
    but I am not sure whether his description of the function of
    corpus data
    "as a mediator of conflicting linguistic judgements"
    is adequate or sufficient.
    Corpus data is certainly essential for an accurate description of
    language.

    Ramesh Krishnamurthy
    Consultant, Collins Dictionaries and Bank of English corpus
    Honorary Research Fellow, Corpus Linguistics, University of Birmingham



    This archive was generated by hypermail 2b29 : Thu Apr 26 2001 - 15:49:33 MET DST