Re: Corpora: minimum size of corpu?

From: michael klotz (mklotz@phil.uni-erlangen.de)
Date: Thu Feb 10 2000 - 09:47:36 MET

  • Next message: Priscilla Rasmussen: "Corpora: ANLP/NAACL2000 Workshop Second Call for Papers"

    Hi Elaine,

    it seems to me that there is a crucial difference between studying a
    "dead language" like Latin or, I suppose, Biblical Aramaic and a "living
    language" like say modern English. The problem with living languages is
    of course that any corpus will be tiny compared to the overall
    linguistic output of the speakers of the language (for example in a
    single year). If we want to use corpus evidence to say something about
    the language as a whole, we are crucially concerned with the question of
    how confident we can be that our corpus data actually mirror the facts
    of language. This is a question for inferential statistics and the size
    of our sample (i.e. corpus) plays an important role in this. (Another
    important question would be how we proceeded in the sampling to achieve
    representativity in terms of random sampling, stratified sampling etc.
    Cf. the work done by Clear and Biber on this question)
    With dead languages there are two possible approaches: in one approach
    we would consider whatever evidence we have for the language as a sample
    of the way the language was spoken at the time. Of course, again it
    would be a tiny sample of the overall linguistic output of the speakers
    at the time and the problems from above would be relevant.
    However, in another sense whatever sources are left of a dead language
    can be operationally considered to BE the language, since nobody will
    ever produce new output in that language; i.e. there is a finite body of
    parole. In this case your sample (i.e. corpus) would be identical to the
    population it stands for (i.e. the "whole" language as we see it today),
    and we would not be concerned with inferential statistics, but simply
    summative statistics. It that case the size of the corpus would be of no
    concern, I think.
    Which of the two approaches you take really depends on your research
    question. If you want to say something about Biblical Aramaic as found
    in the extant sources, the second approach seems appropriate. If you
    want to compare Biblical Aramaic to its modern descendants to say
    something about how the language has changed, the first approach seems
    more appropriate.

    yours
    Michael

    --
    Dr. Michael Klotz
    Institut f. Anglistik und Amerikanistik
    Universität Erlangen-Nürnberg
    Bismarckstraße 1
    91054 Erlangen
    Tel.: 9131-8522938
    email: mklotz@phil.uni-erlangen.de
    



    This archive was generated by hypermail 2b29 : Thu Feb 10 2000 - 09:47:03 MET