Corpora: corpus/corpora and Harris/Chomsky

From: ramesh@clg.bham.ac.uk
Date: Sat Apr 07 2001 - 02:49:28 MET DST

  • Next message: sheri lyn pargman: "Re: Corpora: corpi"

    Two threads seem to have come together, but they may mirror each other
    in a way, so I will address both.

    1. "a corpora":

    1.1 The objection to the "misuse" of plural form in singular
    context in some emails (often by, I think it's fair to say, non-expert
    speakers of English, or people just starting out in corpus linguistics)
    diverted attention from the *content* of their emails to the *form*.
    Which is a shame, because several of the emails were pleas for help,
    to which I have not seen many replies...

    1.2 Such "misuse" may also be due to carelessness rather than ignorance.
    How carefully do we all edit our emails? Some obviously more than others.
    If we spend too long editing, we lose the spontaneity; if we don't edit
    at all, we make typos, overlook errors, etc. Diieferent strokes....
    (see what I mean?).

    1.3 Such "misuse" may be or may become evidence for language change.
    For example, do you say "the data is" or "the data are"?
    Historically and etymologically, the latter
    is "correct" and the former is wrong (in Latin, datum=singular,
    data=plural), yet current usage seems to be roughly equal.

    1.4 The evidence from the Collins COBUILD Bank of English corpus at
    Birmingham University, consisting of 418 million words of 1990s data is:
    "the data is" = 147 examples, attested in all subcorpora, and fairly
    evenly spread
    "the data are" = 159 examples, but attested only in some subcorpora, and
    markedly frequent in US subcorpora and more formal BR subcorpora
    (US academic textbooks, US newspapers, New Scientist, Economist)
    [***full details can be supplied to any interested parties]

    1.5 This reflects the fact that US academic writing adheres slightly more
    to traditional uses, and that, as formerly technical terms move into
    more mainstream use, their use often changes (many lay people won't know
    about Latin declensions...)

    1.6 The evidence for "bacteria", however, is:
    "the bacteria is" = 8
    "the bacteria are" = 41
    which shows that the same process does not necessarily operate on all
    Latinate forms to the same extent or at the same rate.

    1.7 BTW, there were only two examples of "a data", both from
    British spontaneous spoken data:

    A: There's one piece of data I don't dispute that
    B: Yeah. Right. But a data is not a fact.

    C: ... my point of view is so maybe er you ... calculated and maybe you
    ... came across such a data which er give us possibility to compare the
    losses...

    1.8 Which reminds me that emails are a curious halfway-house between
    spoken and written modes: spontaneous emails are closer to spoken,
    and therefore more likely to contain more idiosyncratic uses.

    1.9 "corpi" derives the plural form from the wrong Latin declensional
    paradigm. But hey, we are participating in the creation of the English
    of the future here, not correcting people's knowledge or ignorance of
    Latin. If in 100 years' time, consensus use prefers "corpi", what's
    the problem?

    1.10 Finally, do we correct everyone who says "a graffiti", because
    we happen to know that "graffiti" is plural in the original Italian,
    and the singular should be "graffito", or should I correct everyone who
    mispronounces my name (most non-Indians are actually incapable of
    producing the correct pronunciation even when coached).

    2. The Chomsky/MIT/generative debate.

    2.1 This seems to have provoked an equally strident debate, which
    also reflects underlying "right/wrong" beliefs.

    2.2 However, in between the polemic I have discerned several nuclei
    (is that a "correct" usage?) of useful historical information,
    interesting perspectives on the relationships between different
    branches and traditions of linguistics, and quite a lot of humour!
    :--)

    2.3 IMHO, this is one of the best threads I have seen for a long time on
    this list. I think people should have a chance to mouth off about their
    pet hates, niggles, etc, as it stimulates others to think hard about
    what the underlying "truths" are, which analogies work and which don't,
    and maybe even to go away and read some of the literature that others
    have recommended...

    2.4 For a relatively young field (at least under the name of Corpus
    Linguistics), I think this is very healthy. It has certainly given
    me much food for thought (for example, "in what way does corpus linguistics
    currently lack or ignore "explanatory adequacy"?), and added to the
    backlog of "stuff I must get round to when I retire"; for both of which
    I am truly thankful!

    2.5 BTW, the notion of "transformational" and "generative" grammar
    2.5 BTW, the notion of "transformational" and "generative" grammar
    did not begin with Chomsky (although maybe the English terms did),
    nor even in the current century. Panini's
    grammar of Sanskrit (which in turn borrowed from many previous scholars'
    work) embodies both principles, although it actually starts with a
    "functional" basis. Very crudely: (a) "What do you want to express?", then
    (b) "start with this form", and (c) depending on your specific contextual
    requirements and preferences, which you process your way through rather
    like a multiple-choice questionnaire, "execute these transformations on
    that form". The entire description of the grammatical system strongly
    resembles a computer program or flowchart, with structural features
    such as "if...then", "go to", "do this repeatedly until...", etc.
    To quote Goodness Gracious Me: "Grammar? Indian!".

    :--)

    Ramesh Krishnamurthy
    Honorary Research Fellow, Birmingham University
    Honorary Research Fellow, Wolverhampton University
    Consultant, Cobuild and Bank of English Corpus, Collins Dictionaries



    This archive was generated by hypermail 2b29 : Sat Apr 07 2001 - 02:43:05 MET DST