Re: [Corpora-List] size of reference corpus - resent

From: Lam Yuen Wing, Peter (ywlam@kcrc.com)
Date: Tue Jun 17 2003 - 11:13:37 MET DST

  • Next message: krausse: "[Corpora-List] bnc word list"

    ** This e-mail has been bounced and is now resent from a registered address.
    Sorry for any duplicates. **

    Thanks very much, Mike.

    Your advice do help. I'll go ahead as you advised, i.e., maintain the size
    of my specialised corpus, which is double that of the Brown. The 2 resulted
    key word lists look almost the same whether I use the Brown corpus or mine
    as the reference corpus: the only difference is that the order of the key
    words is the reverse of the other, apart from some possible statistical
    implications unknown to me.

    Thanks for directing me to the locations where I can download the word-lists
    of BNC and the Guardian. But as I'll do also collocation and colligation
    analyses on the key words, I can only use these word-lists as a reference,
    particularly when I need to compare the word-list of my corpus with those
    derived from corpora of contemporary English.

    In fact, I'll compare "my mouse-mat against all the objects in your room"
    and probably also "against a whole lot of other mouse-mats."

    By the way, could you direct me to any websites where free lemmatising
    and/or tagging software is available.

    Best wishes,
    Peter

    ----- Original Message -----
    From: "Mike Scott" <mike@lexically.net>
    To: "Lam Yuen Wing, Peter" <ywlam@kcrc.com>; <corpora@hd.uib.no>
    Cc: <peterlam@onebb.net>
    Sent: Friday, June 13, 2003 10:40 PM
    Subject: Re: [Corpora-List] Re: size of reference corpus

    > Lam Yuen Wing, Peter wrote:
    >
    > >I'm working on my MA degree project using WordSmith Tools to analyse a
    > >specialised corpus. Because of the limited availability of a reference
    > >corpus, I can only use the Brown corpus as my reference corpus, which is
    > >just half the size of my specialised corpus.
    > >
    > >Could anyone advise me the implications of having a reference corpus half
    > >the size of the specialised in a corpus analysis using WordSmith Tools.
    > If the reason why you need the reference corpus is for processing
    key-words
    > in WordSmith, you will usually need one that's *bigger* than your
    > specialised corpus. WS checks to see which is bigger before doing the
    > key-words procedure. But it's a bit more complicated than that...
    >
    > 1. the KeyWords procedure was originally designed to study texts, not
    > genres or languages. All the same, it can be used for collections of texts
    > and still try to locate lexical items whose frequency is unusual. But
    there
    > will certainly be statistical implications of a non-straightforward nature
    > (OK, there are in almost anything but especially with such odd items to
    > study as words, which do not distribute themselves at all "normally"). So
    > my advice is -- go ahead, but think of it as a method of finding out which
    > words may well repay further investigation.
    >
    > 2. You mightn't need the actual text for your reference corpus but only a
    > word-list based on that corpus. You can download a full word-list of BNC
    > written (based on 90 million words) and BNC spoken (10 million) from my
    > website. You can also download a wordlist based on nearly 100 million
    words
    > of the UK newspaper The Guardian, 1990-94 as I recall, which contains all
    > items occurring at least twice. All these are word-lists in WordSmith 3
    > format. (I plan to do these again in WS4 format but in any case WS4 comes
    > with a conversion tool.)
    >
    > 3. There are lots of modes of comparison. You can of course study
    > individual texts or smaller sets of texts and compare them with Brown one
    > by one. You can compare individual texts with the set of all texts in your
    > specialised corpus. I think this depends on what your research questions
    > are (what you want to find out about the specialised corpus).
    >
    > 4. The easiest way of thinking about this (to me anyway) is by analogy. If
    > you want to find out the characteristics of the mouse-mat in front of you,
    > you might compare it against a whole lot of other mouse-mats (discovering
    > it's much brighter, say) or against a whole lot of computer stuff in front
    > of you (it's not beige), or against all the objects in your room (it's
    much
    > flatter than most).
    >
    > Hope this helps.
    >
    > Best wishes -- Mike
    >
    >
    > Mike Scott
    >
    > Applied English Language Studies Unit
    > University of Liverpool
    > Liverpool L69 3BX, UK.
    >
    > Mike.Scott@liv.ac.uk
    > http://www.lexically.net
    > http://www.liv.ac.uk/~ms2928
    >
    >
    >

    This email and any attachment to it may contain confidential or proprietary
    information that are intended solely for the person / entity to whom it was
    originally addressed. If you are not the intended recipient, any
    disclosure, copying, distributing or any action taken or omitted to be taken
    in reliance on it, is prohibited and may be unlawful.
    Internet communications cannot be guaranteed to be secure or error-free as
    information could be intercepted, corrupted, lost, arrive late or contain
    viruses. The sender therefore does not accept liability for any errors or
    omissions in the context of this message which arise as a result of
    transmission over the Internet.
    No opinions contained herein shall be construed as being a formal disclosure
    or commitment of the Kowloon-Canton Railway Corporation unless specifically
    so stated.



    This archive was generated by hypermail 2b29 : Tue Jun 17 2003 - 13:26:33 MET DST