Re: Corpora: when does a subcorpus become a corpus

From: P bI K O B_ B.B. (rykov@narod.ru)
Date: Thu Jan 03 2002 - 11:30:35 MET

  • Next message: Scott Sadowsky: "Corpora: SUM: Tools for Spanish corpora"

      I'd like to join and fiercely support any words discussing the corpus representativeness issue. Without knowing definitely this corpus feature all the results of any corpus based linguistic research are false or uncertain.

    ---------------------

    >Here is a short citation from Jennifer Pearson's "Terms in Context"
    >(Amsterdam 1998), p. 45:
    >
    >--
    >Sinclair, who states that corpora can be divided into subcorpora, and that
    >corpora and subcorpora can be divided into components, defines a subcorpus
    >as having "all the properties of a corpus but happens to be part of a
    >larger corpus" (1994a:4). Thus, a subcorpus must have all the properties of
    >a larger corpus. We understand this to mean that it is representative of
    >the larger corpus. A component, on the other hand, according to Sinclair,
    >illustrates a particular type of language and is selected "according to a
    >set of linguistic criteria that serve to characterize its linguistic
    >homogeneity" (Sinclair 1994a:4). It differs from a subcorpus in that it is
    >not intended to be representative of the corpus from which it is drawn and
    >is therefore not necessarily an adequate sample of a language.
    >--
    >
    >I did not go back to Sinclair ("Corpus Typology: A Framework for
    >Classification", EAGLES 1994), but according to Pearson, "a subcorpus must
    >have all the properties of a larger corpus", thus being representative of
    >the larger corpus. Another question is how this can be achieved, although,
    >it is, obviously, safer to state that a subcorpus is representative of the
    >larger corpus, than argue that the larger corpus (and, consequently, the
    >subcorpus) is representative of a language (or genre etc.). Anyways, using
    >the terms defined above (without intention to agree fully with Pearson),
    >the set of EAP texts detached from the BNC would probably be called a
    >"component" rather than a "subcorpus". Personally I would like to call a
    >"subcorpus" ANY corpus detached from another corpus - despite its content
    >or composition. Whatever a set of texts is called, the question of
    >representativeness remains. Here I agree with Ute Roemer, who wrote: "The
    >important question in this context is 'What do you want to do with the
    >(sub)corpus?'"
    >
    >sincerely,
    >Sampo
    >
    >Ps. Please regard this as a note from a person who tends to consider the
    >notion of "representative of a language" as an oxymoron, a "mission
    >impossible".
    >
    >
    >
    >( : ============================================= : )
    >
    >Sampo Nevalainen, M.A.
    >Researcher
    >University of Joensuu
    >Savonlinna School of Translation Studies
    >P.O.Box 48
    >FIN-57101 Savonlinna
    >FINLAND
    >
    >tel +358-15-511 70 (operator)
    > +358-15-511 7704
    >fax +358-15-515 096
    >email samponev@cc.joensuu.fi
    >http://www.joensuu.fi/slnkvl/
    >
    >

    -- 
    Vladimir Rykov, PhD in Comp Linguistics, 
     MOSCOW
    http://rykov.narod.ru/
    Engl. http://www.blkbox.com/~gigawatt/rykov.html
    Tel +7-903-749-19-99
    



    This archive was generated by hypermail 2b29 : Thu Jan 03 2002 - 11:34:54 MET