Re: [Corpora-List] Re: size of reference corpus

From: Mike Scott (mike@lexically.net)
Date: Fri Jun 13 2003 - 16:40:01 MET DST

  • Next message: Susana Sotillo IMAP: "Re: [Corpora-List] Legal aspects of compiling corpora"

    Lam Yuen Wing, Peter wrote:

    >I'm working on my MA degree project using WordSmith Tools to analyse a
    >specialised corpus. Because of the limited availability of a reference
    >corpus, I can only use the Brown corpus as my reference corpus, which is
    >just half the size of my specialised corpus.
    >
    >Could anyone advise me the implications of having a reference corpus half
    >the size of the specialised in a corpus analysis using WordSmith Tools.
    If the reason why you need the reference corpus is for processing key-words
    in WordSmith, you will usually need one that's *bigger* than your
    specialised corpus. WS checks to see which is bigger before doing the
    key-words procedure. But it's a bit more complicated than that...

    1. the KeyWords procedure was originally designed to study texts, not
    genres or languages. All the same, it can be used for collections of texts
    and still try to locate lexical items whose frequency is unusual. But there
    will certainly be statistical implications of a non-straightforward nature
    (OK, there are in almost anything but especially with such odd items to
    study as words, which do not distribute themselves at all "normally"). So
    my advice is -- go ahead, but think of it as a method of finding out which
    words may well repay further investigation.

    2. You mightn't need the actual text for your reference corpus but only a
    word-list based on that corpus. You can download a full word-list of BNC
    written (based on 90 million words) and BNC spoken (10 million) from my
    website. You can also download a wordlist based on nearly 100 million words
    of the UK newspaper The Guardian, 1990-94 as I recall, which contains all
    items occurring at least twice. All these are word-lists in WordSmith 3
    format. (I plan to do these again in WS4 format but in any case WS4 comes
    with a conversion tool.)

    3. There are lots of modes of comparison. You can of course study
    individual texts or smaller sets of texts and compare them with Brown one
    by one. You can compare individual texts with the set of all texts in your
    specialised corpus. I think this depends on what your research questions
    are (what you want to find out about the specialised corpus).

    4. The easiest way of thinking about this (to me anyway) is by analogy. If
    you want to find out the characteristics of the mouse-mat in front of you,
    you might compare it against a whole lot of other mouse-mats (discovering
    it's much brighter, say) or against a whole lot of computer stuff in front
    of you (it's not beige), or against all the objects in your room (it's much
    flatter than most).

    Hope this helps.

    Best wishes -- Mike

    Mike Scott

    Applied English Language Studies Unit
    University of Liverpool
    Liverpool L69 3BX, UK.

    Mike.Scott@liv.ac.uk
    http://www.lexically.net
    http://www.liv.ac.uk/~ms2928



    This archive was generated by hypermail 2b29 : Fri Jun 13 2003 - 16:39:33 MET DST