Re: Corpora: Corpus size

From: geoffrey.williams (geoffrey.williams@wanadoo.fr)
Date: Sun Jun 03 2001 - 12:35:43 MET DST

  • Next message: jim@ling.ed.ac.uk: "Corpora: EvoLang2002"

    Hi Jerome,

    Much work in specialised corpora has taken 500000 as about the size to be
    achieved. This was what Chris Gledhill, now Univ St Andrews used for his
    corpus on cancerology, mine in parasitic plant biology was the same spread
    over 155 articles. Most people I have come into contact in this field tend
    to go for this figure. This does seems to be an accepted size although the
    reason is unknown, except perhaps being half the mythical Brown.

    Size depends very much on what you are doing, more important is homogeneity.
    This means justifying your corpus as being in some way representative of a
    defined research community.

    If you want to compare with other people working in France, you might be
    interested in the workshop I am organising with the Association Fraçaise de
    Linguistique Appliquée on 14 september in Lorient. Details are on the
    university website at:
    http://www.univ-ubs.fr/crellic/agenda.htm
    I shall shortly be posting the call for papers on this list.

    best

    Geoffrey
    *************************************************
    Geoffrey C. Williams, MSc, PhD
    Département Langues Etrangères Appliquées
    U.F.R. Lettres et Sciences Humaines
    4, rue Jean Zay
    B.P. 92116
    56321 LORIENT Cedex
    FRANCE

    tél : 33 (0) 2 97 87 29 68
    fax : 33 (0) 2 97 87 29 70

    email : geoffrey.williams@univ-ubs.fr

    http:\\www.univ-ubs.fr\crellic
    ***************************************************
    ----- Original Message -----
    From: jerome richalot <jerome.richalot@insa-lyon.fr>
    To: <corpora@lists.uib.no>
    Sent: Thursday, May 31, 2001 9:07 AM
    Subject: Corpora: Corpus size

    > Dear list members,
    >
    > For my PhD dissertation, I am compiling a corpus of 75 research paper
    > articles in the field of piezoelectricity. Although I haven't computed an
    > exact word count a rough estimate is that it will be anything btw 250,000
    > and 300,000 words which I believe will be a good representation of the
    > targetted literature.
    > Would anyone know of other corpora of very technical or scientific English
    > and their size ?
    > Thanks in advance
    >
    >
    > -------------------------------------------------
    > Jerome Richalot
    > Institut National des Sciences Appliquées de Lyon
    > English/Electrical Engineering Coordinator
    >
    > Tel (33) 472 436 168
    > Fax (33) 472 438 519
    >
    >



    This archive was generated by hypermail 2b29 : Sun Jun 03 2001 - 13:01:16 MET DST