Corpora: Corpus Linguistics

From: ramesh@clg.bham.ac.uk
Date: Mon Apr 09 2001 - 02:48:45 MET DST

  • Next message: Michael Barlow: "Re: Corpora: Chomsky and corpus linguistics"

    Christopher Bader said:
    >> 2. In his more recent work, Chomsky distinguishes between
    >> the E-language (e.g. the set of all grammatical sentences)
    >> and the I-language (the human language faculty). Generative
    >> grammarians study the latter; corpus linguists, the former.
    >> The Chomsky Hierarchy and Chomsky Normal Form are
    >> of course concepts pertaining to the E-language, not to
    >> the I-language, which is why Chomsky no longer works
    >> in this area.

    Tony Mcenery commented:
    > I see no problem with the above statement, other than to say that at
    > times Linguistics has excluded the study of E-language (in the sense of
    > attested language use as opposed to the concoction of invented examples)
    > as being part of linguistics proper.

    Ramesh comments:
    I suggest that Tony has not gone far enough in just modifying the definition
    of E-language ("in the sense of attested language use").
    Even to accept that "corpus linguists study E-language" (in any sense) is
    to describe apples in terms of oranges. Why use Chomskyan terms at all?

    a) If E-language is the set of *all* grammatical sentences, surely not even
    the most idealistic corpus linguist would claim that this is what they are
    currently studying, or even hoping to study in some distant future. The
    billion-word corpora that are just around the corner (and allowing for
    even further massive leaps in corpus size beyond that) will still represent
    only a small sample of any language community's total discourse. All that
    corpus linguists will ever be able to say is that certain linguistic
    features and patterns are well attested in a particular corpus, and others
    are rare or marked/constrained in some way. Models extrapolated from the
    well attested features are likely to be fairly robust, but will always
    leave some new input data categorially indeterminate between error,
    creativity, local usage, humour, and so on.

    b) If E-language is the set of all *grammatical* sentences, again I would
    doubt that corpus linguists would say that this is what they are studying.
    Grammaticality is only one criterion in the evaluation of a corpus instance.
    "Ungrammatical" instances are a valid part of a descriptive language model,
    and may be accounted for in various ways, for example by reference to
    real-time interactional factors, pragmatics, sociolinguistics, pathology,
    or other extra-linguistic factors. On the other hand, many grammatically
    possible, invented examples may remain unsubtantiated by corpus attestation
    for a very long time. My own research into countless invented dictionary
    examples shows that many are absent or extremely rare in the 418 million-word
    Bank of English corpus, for example. Texts produced for purposes other than
    linguistic exemplification or disputation exhibit features of a quality
    which John Sinclair has termed "naturalness", which appear to be beyond
    capture by grammaticality alone.

    Instead of adopting a Pythonesque "What has Chomsky ever done for
    linguistics/usable language technology/etc"
    stance, and then having to refute various suggestions made, or
    having to concede gradually "apart from X, Y, and Z", corpus linguists
    would do well to continue to plough their own furrow. A bottom-up methodology
    will inevitably take longer to arrive at the higher reaches of observational
    adequacy, let alone to satisfy any other adequacies (if that is what our
    aim is...). And of course, language is dynamic, so not only the descriptive
    model, but even the notion of observational adequacy, will have to be
    dynamic as well...

    Ramesh Krishnamurthy
    COBUILD/University of Birmingham/Collins Dictionaries



    This archive was generated by hypermail 2b29 : Mon Apr 09 2001 - 02:42:26 MET DST