Re: [Corpora-List] Plea for help

From: geoffrey.williams (geoffrey.williams@wanadoo.fr)
Date: Tue Nov 19 2002 - 09:20:42 MET

  • Next message: Serge Sharoff: "[Corpora-List] On tools for indexing and searching large corpora"

    Hi,

    Although I have until now answered directly so as not to overfill inboxes, I
    thought some general considerations might be useful. For me, small corpora
    are in the half millions, but then I have texts that run to the 2000 tokens,
    letters being shorter we have an entire new ball game in which the question
    of small corpora and what they can show has to be considered. The Ghadessy
    et al is a good intro to small corpora, but we still need to discuss the
    pheneomenon in terms of its role and limitations.

    For myself, I wouldn't term "business writing" "academic discourse", unless
    of course you are in the financial side of academia rather than on the
    poorly paid side as most of us are.You will need to decide what type of
    business you are interested in as letters will vary enormously depending on
    the type of business and the purose of the correspondance.

    Corpora are about breadth and size.

    Breadth gives the variety needed and means that you cannot just take the
    letters of one writer as this would be studying author style and no
    generalisation would be possible.Breadth however is constrained by the need
    for homogeneity which means that you will have to be clear as to what sorts
    of letters go into the corpus.

    Size allows generalisation by being able to make statistically substantiated
    observations. "Small corpora" are fine for studies of precise events studied
    stylistically, but limit what you can say lexically. With such a corpus it
    will be difficult to make comments as to collocation as your base will be
    small. Do not forget Zipf's law on diminishing returns. About half of your
    tokens will be hapax legomena, occurring only once, and out of the other
    half the lion's share will be high frequency grammatical items. You will be
    able to find some repeated sequences, try using the "clusters" in WordSmith,
    and some "candidate" collocations from the speciality in whih you are
    working.

    For corpus building, you will need to negotiate access to mail. The easiest
    is obviously email, but the genre is different from that of smail mail.
    Snail mail will have to be scanned. You will also need to mark up your texts
    as the sections, greetings etc, are of importance. For this you should use
    the TEI recomendations, TEI lite is fine. Otherwise you just chuck the whole
    lot in the concordancer and see what comes out. The big problem will be
    finding a tame company that allows you access to its letters. They will
    certainly demand that the texts are rendered anonymous, and that there is
    only limited access to the corpus.

    There are a number of obstacles to overcome in such small corpus work, but
    difficult corpora yield interesting results, they just take time.

    Good luck

    Geoffrey

    ***********************************************************

    Dr Geoffrey C. Williams,
    Département Langues Etrangères Appliquées
    U.F.R. Lettres et Sciences Humaines
    4, rue Jean Zay
    B.P. 92116
    56321 LORIENT Cedex
    FRANCE

    tél : 33 (0) 2 97 87 29 68
    fax : 33 (0) 2 97 87 29 70

    email : Geoffrey.Williams@univ-ubs.fr

    http://www.univ-ubs.fr/crellic

    ***************************************************
    ----- Original Message -----
    From: "Isa Abdul kaader" <I.Abdul-kaader@postgrad.umist.ac.uk>
    To: "geoffrey.williams" <geoffrey.williams@wanadoo.fr>
    Cc: <CORPORA@HD.UIB.NO>
    Sent: Sunday, November 17, 2002 1:54 AM
    Subject: Re: [Corpora-List] Plea for help

    > Hi Geoffrey,
    >
    > Many thanks for your suggestions, Geoffrey and to all the others who
    gave
    > such wonderful advise with books etc for me to get an understanding of
    Corpus
    > Linguistics.
    >
    > Having understood the essentials, have decided to focus on attention on
    > academic writing especially
    > 1. Business Writing ( letters of compliant and adjustment) and /or
    > 2. Technical Report writing ( could be hardware/software developemnt in
    higher
    > technical institutions).
    >
    >
    > Apart from the critical study in langauge teaching materials by
    Kennedy (
    > 1987a) Holmes ( 1988), Mindt ( 1992) and especially Conrad ( 1996b)who
    dealt
    > with academic text and corpus based techniques, ARE there any other study
    that
    > looks at specialized registers like business writing and technical
    reports.
    >
    > Have yet to get full access to BNC but would like to know if such
    corpora
    > ( registers) could be obtained from the "Official document and Academic
    Prose"
    > listing in the BNC index to compare the linguistic characteristics of the
    > corpus that I wish to compile.
    >
    > I want to compile a small corpus (20,000 to 30,000) with regard to
    the
    > register listed. I do NOT KNOW how to get started with this. Need all
    help
    > with this.
    >
    > In short, I am interested in compiling a corpus to study its
    > characteristics and applications to exploit it to design best possible
    > materials and activities to help my students understand and produce the
    > registers listed above appropriately (helping students with language that
    is
    > actually used in these settings).
    >
    > Keen to know research that states appropriateness and potential of
    > corpus ( including collacations) in Computer Assisted Langauge Learning at
    > higher insitutions especially to teach technical writing.
    >
    > Very many thanks in advance for your all your help.
    >
    > Rafiq
    > Temsek Polytechnic
    > Singapore
    >



    This archive was generated by hypermail 2b29 : Tue Nov 19 2002 - 09:30:43 MET