Corpora: Announcement: BNC Index

From: David Lee (david_lee00@hotmail.com)
Date: Mon Apr 23 2001 - 15:56:06 MET DST

  • Next message: David Lee: "Corpora: Announcement: BNC Index"

    Dear All,

    Every now and then, there are requests for corpora/subcorpora of
    specific genres of English. Recently, for example, there were requests
    for “academic EFL/ESL texts” and another one for “business English”. In
    the past, people have also asked for things like “medical language”,
    “e-mail discussions” or “children’s writing”.

    If it’s British English you’re after, there is perhaps no better place
    to start than with the British National Corpus
    (http://info.ox.ac.uk/bnc/),
    which contains all the above (sub)genres and more. However, up till now,
    it’s been very difficult for most end-users to quickly browse/search the
    BNC by genre or by a combination of criteria such as audience age,
    author age, domain of discourse, medium, audience level, etc. in order
    to find specific texts which fit specific research needs precisely. I
    suspect this difficulty is why many people never think of looking in the
    BNC for what they want.

    At TALC 2000 in Graz, I first announced the work that I had been doing
    on categorising all the BNC texts in terms of genre (e.g.
    Written_Academic_Prose_Social Sciences; Written_Imaginative_Poetry;
    Spoken_Consultations; Spoken_Courtroom_Discourse). I would like to now
    announce that this resource that I’ve been working on, called the "BNC
    Index", is now available for use (in spreadsheet format & other
    incarnations, see below). This genre classification of texts has also
    been incorporated into the headers of the 4,055 files of the new BNC
    World Edition. (The BNC Index itself, however, covers all the 4,124
    files of BNC Version 1.)

    ==========

    The BNC Index itself, in Microsoft Excel spreadsheet format, is
    available from:
    http://members.nbci.com/davidlee00/corpus_resources.htm

    If you don’t like spreadsheets or would like an easier interface, try
    the BNC Web Indexer (a front end to the BNC Index) at:
    http://www.comp.lancs.ac.uk/computing/research/ucrel/bncindex/

    (Access is not restricted, but please register your details on the
    welcome page and read the documentation & caveats before using.)

    Alternatively, you can download the stand-alone program written by
    Antonio Ortiz (who announced this recently on the list):
    http://webdeptos.uma.es/filifa/personal/amoreno/indexer

    The differences between the last two facilities:

    (1) The BNC *Web* Indexer and spreadsheet will be updated regularly,
    whenever errors are spotted and reported to me, whereas Ortiz’s
    stand-alone BNC Indexer will be updated as and when time permits. (At
    time of writing, Ortiz' program has not included my latest changes, and
    is thus not up-to-date.)

    (2) At present the *Web* Indexer doesn't allow selection of more than
    one option within each field/category (e.g. you cannot select more than
    one genre, more than one author age range, and so on). The *stand-alone*
    Indexer does. (Multiple selections are also possible, of course, if you
    use the spreadsheet.) This limitation will (hopefully) be fixed soon.

    So... choose according to your needs.

    ==========

    These resources will allow users to scan the BNC by genre (24 spoken and
    46 written genres) and a number of other criteria (time period, audience
    level, spontaneity, library keywords, bibliographical details, etc.)

    But note the following caveats:

    (1) genre classifications were done within time constraints, so I would
    advise manual checks on search results where possible.

    (2) read the documentation on the categorisation scheme before
    proceeding.

    The point of the BNC Index (or Indexers) is to enable researchers
    (esp. those not particularly computer-literate) to obtain lists of
    BNC file IDs for constructing their own particular sub-corpora for use
    with stand-alone PC concordancers such as WordSmith or MonoConc (which
    allow users to specify a list of files as a subcorpus to restrict
    queries to).

    The server-based SARA and BNCWeb programs can already do this, but they
    don’t allow pure part-of-speech-tag searches. People using stand-alone
    PC concordancers for this reason can now specify subcorpora at the file
    level by first using the BNC Index to obtain relevant file IDs.

    I hope some people will find this useful.

    David Lee

    -----------------------------------------------------------------
    David YW Lee
    Visiting Researcher
    Dept of Linguistics
    Lancaster University
    Lancaster LA1 4YT
    England, UK.

    Email: david_lee00@hotmail.com
    -----------------------------------------------------------------



    This archive was generated by hypermail 2b29 : Mon Apr 23 2001 - 16:08:58 MET DST