RE: Corpora: Evidence and intuition

From: Patrick Hanks (patrick@lingomotors.com)
Date: Thu Nov 01 2001 - 16:13:09 MET

  • Next message: George Demetriou: "Re: Corpora: protein name list"

    David -

    Your point is well taken. It reminds us of an important
    distinction: corpus linguistics concerns itself with what
    *does* occur (more strictly, with what *has* occurred).
    Theoretical linguistics (at any rate, Chomskyan theoretical
    linguistics) concerns itself with what *might* occur. Given
    statistics such as that in most corpora 2% of the types
    account for over 80% of the tokens (not to mention all the
    types which *might* occur but don't - e.g. inventions from
    the whole cloth of native phonology, or names and other
    borrowings from foreign languages -- the distinction between
    the probable and the possible seems an important one.

    So I guess one's definition of "the most interesting
    phenomena" is a matter of taste or application. From a
    practical engineering point of view [I sit surrounded by
    engineers!], robust processing of the probable seems a more
    achievable and realistic goal than even imperfect processing
    of all possibilities.

    Patrick

    -----Original Message-----
    From: David Wible [mailto:dwible@mail.tku.edu.tw]
    Sent: Wednesday, October 31, 2001 11:59 PM
    To: Patrick Hanks; corpora@hd.uib.no
    Cc: CPA
    Subject: Re: Corpora: Evidence and intuition

    Patrick mentions theoretical linguists using the doubts about the
    representativeness of corpora to cast doubt on results or conclusions
    drawn
    by linguists using corpora. In my experience, the criticisms are more
    likely of almost the opposite sort: that is, those inclined to criticize
    corpus research suggest that some of the most interesting phenomena
    about a
    speaker's knowledge is not the stuff that s/he hears examples of often
    (stuff that would presumably then occur frequently in corpora), but the
    reverse: strong intuitions about uses they perhaps have never
    encountered.
    Isn't it these sorts of data and not those that are amply represented in
    'representative' corpora that make us ask: how did they ever come to
    know
    that?

    David Wible

    ----- Original Message -----
    From: "Patrick Hanks" <patrick@lingomotors.com>
    To: <corpora@hd.uib.no>
    Cc: "CPA" <CPA@lingomotors.com>
    Sent: Thursday, November 01, 2001 7:32 AM
    Subject: Corpora: Evidence and intuition

    A late contribution to the discussion sparked by Sebastian Hoffmann:

    I recently asked a few colleagues who are not corpus linguists to make
    up
    a couple of natural sentences using the word "total" as verb. The
    answers
    typically fall into two classes:

    1. [[Driver]] total [[Vehicle]]
       e.g. Carina totaled the car.

    2. [[Person]] total [[Number]]
       e.g. John totaled the column of figures.

    In the British and American corpora that we are currently using (in
    particular
    BNC, Reuters, and 4 years of AP), sense 1 accounts for less than 1% of
    uses
    of the verb and sense 2 is even rarer - perfectly plausible, but next to
    non-
    existent.

    Over 98% of corpus uses of this verb fall into the following pattern:

    3. [[Entity (often plural)]] total [[Number | Amount]]
       e.g. Sales totaled 6 million.

    Why did this *very* common pattern of use not spring immediately to the
    minds of ordinary native speakers of british or American English?
    Hypotheses
    include:

    a) Introspection as a technique favors human subject roles.
    b) 3 is really a copula, "not a real verb".
    c) There is an inverse relationship between cognitive salience
    and
        social salience

    Re 3, see (Hanks 1990), where I argued that people register the odd or
    unusual
    and fail to register what we do regularly or continuously. (Think of
    someone
    putting his/her hand on your arm. Now think of someone having had
    his/her hand
    on your arm all afternoon.)

    Whatever the reason, the phenomenon is a familiar one in lexical
    analysis,
    first noticed by Cobuilders working on the Cobuild 7.3 million word
    corpus
    in about 1983. Of course, 'total' is a fairly dramatic example, but
    other less
    dramatic cases abound, e.g. the "delexical verbs" (known in America as
    "light
    verbs). Ask people to make up examples for common uses of "take" and
    very
    few of them will think of [[Duration]]:

    4. How long will it take?

    5. It only took a few minutes.

    Interestingly, the phenomenon is occasionally denied by some theoretical

    linguists and other intelligent people, corpus evidence to the contrary
    notwithstanding. The opening shot is usually "Your corpus is not
    representative" (?!). Why do they do this? Surely it cannot be as
    simple
    as wishing to preserve introspection as a research technique?

    Patrick



    This archive was generated by hypermail 2b29 : Thu Nov 01 2001 - 16:48:18 MET