Corpora: Looking for "ing" - re Ute Römer''s request for help

From: Christopher Tribble (ctribble@sri.lanka.net)
Date: Tue Feb 22 2000 - 16:43:57 MET

  • Next message: Ewa Jonsson: "Corpora: Corpora on electronic discourse"

    Dear All,

    I've been moved by Ute Romer's request (see below) to put out a note to the
    list that I'd been meaning to send
    for some time.

    >I'm wondering whether one of you could possibly help me with a
    >research project on future expressions in English. I'm looking
    >for several structures in the spoken part of the British
    >National Corpus and I have some problems to find types like
    >"VERBing", "will be VERBing" and so on.
    >
    >Is there a possibility to find all present progressive forms
    >without doing a separate query on every single verb, i. e.
    >is it possible to insert some kind of "place marker"
    >indicating "base form of lexical verb"?

    What Ute is asking to do (noting also the various caveats about the
    reliability of coding in BNC) has been dead easy for a long time for anyone
    who *hasn't* been using SARA. My own work, and the work of a lot of other
    people, has been made possible by the fact that Mike Scott has been able to
    put together a program that can search selectively for POS tags in large
    corpora (Wordsmith Tools - http://www.liv.ac.uk/~ms2928/homepage.html), and
    that the BNC can be unpacked from the distribution CDROM and worked with in
    a tack DOS environment as a set of plain text files. Others have already
    indicated that you can use Wordsmith or Monconc Pro to hunt for POS tags in
    the BNC -- the problem that many of us/you might have is actually
    transferring the corpus from the distribution disks to a DOS partition on a
    hard disk (it only takes about 1.5 GB of disk space for the WHOLE corpus).

    With this in mind, I've put together a brief account below of how you can
    do this using ordinary DOS tools. I know that the real corpus linguists
    out there could have whacked together a PERL script that would have sorted
    this out in 30 seconds on a 386 Linux box -- but I'm still at the bottom of
    that learning curve and want to ask the corpus questions that Sara can't
    help me with. So for I offer this to any who might be interested. It
    works. If you set Wordsmith TAGS to include "<", ">" as characters, and
    use the search string "*V*G>*ING" you can get results like the brief
    extract below.

    393 <w AJC>smarter <w PRP>by <w VVG>making <w AT0>a <w ORD
    394 e> <w CJC>And <w AV0>then <w VVG>moving <w AT0>the <w UN
    395 UN>, <w DT0>that<w VBZ>'s <w VVG>going <w TO0>to <w VVI>lo
    396 N1>spatula <w PRP>for <w AJ0-VVG>spreading <w NN1>glue<c P
    397 N1>spatula <w PRP>for <w AJ0-VVG>spreading <w NN1>glue<c P
    398 lly <w VHZ>has <w VBN>been <w VDG>doing <w NN2>things <w A
    399 <w VHZ>has <w VBN>been <w VDG>doing <w PNP>them<c PUN
    400 ou<w VHB>'ve <w VBN>been <w VVG>using <w AT0>a <w NN1>
    401 ve <w PNP>you <w VBN>been <w VDG>doing <w AV0>wrong <w N
    402 ou<w VHB>'ve <w VBN>been <w VVG>using <w AT0>a <w NN1>

    Useful?

    ---------------------
    Using the BNC on a PC
    Some guidelines anyone with an interest in using the BNC on a Windows
    computer

    The BNC

    The British National Corpus is a potentially important resource for
    teachers and researchers, but it was designed with the needs of a narrower
    community in mind than the one that I belong to, and at the moment remains
    intimidating and impenetrable for most PC users (the PC being the main IT
    resource for the rest of us these days). Problems arise at a number of
    levels - viz:

    1. unpacking the files from the installation CDROM
    2. identifying which files might be useful
    3. working with the corpus

    This note deals with the first two of these points. It is a bodger's guide
    to setting up the BNC for people with a reasonable understanding of how to
    use a Windows PC. It is not a definitive guide to all the things you can
    do with BNC once you've got it set up.

    Unpacking

    The BNC comes on 3 CDs (at a cost of around £250 for all three - it may be
    possible to negotiate the purchase of CD 1 only - I should have done this,
    but did not realise you only need disk 1 to use BNC on a PC!). These CDs
    contain the BNC data files and a whole host of other applications -
    especially SARA, the search engine specifically designed for the corpus.
    All of these appear to require Unix or a Unix like operating system such as
    Linux for the PC to be installed. My assumption is that the rest of us do
    not want the learning curve involved in setting up Unix on their machines,
    and that more teachers will use BNC if they can use it on their work PCs.

    · requirements

    The good news is that the corpus data can be unpacked and transferred to a
    PC's hard disk without too much trouble. The pre-requirements are:

    · a PC with Windows 95 or better
    · around 6 Giga-bytes (GB) of free disk space
    · Win Zip 32 (a shareware application that is widely available - if you
    haven't got it you can download it from http://www.winzip.com/).

    The steps involved in creating a copy of the corpus should be easy, but you
    might meet a couple of problems because of an error in the creation of the
    original CDROM - this may have been fixed in later versions, but users
    should be aware of the potential glitch.

    · unpacking

    The procedure is as follows:

    · Step 1 - identify the BNC Files
    Put Disk 1 of the three disk set in your PC. Windows Explorer will show
    you that this contains the following folders:

    A.TGZ 51,845,713
    B.TGZ 22,991,840
    C.TGZ 65,652,987
    D.TGZ 321,820
    DOC.TGZ 3,126,547
    E.TGZ 33,186,120
    F.TGZ 46,629,619
    G.TGZ 40,933,113
    H.TGZ 95,877,886
    J.TGZ 27,978,331
    K.TGZ 61,981,017
    SARA.TGZ 394,824
    SGML.TGZ 125,590

    The folders that interest us are A.TGZ through to K.TGZ and DOC.TGZ which
    contains the BNC users' guide. The TGZ extension indicates that the
    folders are compressed. The good news is that WinZip can uncompress these
    files and transfer them from the CDROM on to your PC's hard disk. The bad
    news is that folders A, B and C contain compressed folders which have them
    selves been incorrectly named, and therefore present a problem for
    unpacking.

    · Step 2 - unpack and rename the contents of Folders A,B & C
    With installed WinZip on your PC, when double click on Folder A you will
    see that it contains a folder called "a". This should be called "a.tar" -
    another file compression format which WinZip can also unpack. So to make
    this folder useable it has to be un-zipped to the hard drive and then
    renamed. Do this in the following way:

    - double click on Folder A
    - in the WinZip window select "a" (this is the only folder)
    - select "Extract" from the WinZip menu
    - choose an appropriate folder on your hard disk drive to which you want to
    send the folder (I have a folder called C:\ZIP_TEMP on my PC that I reserve
    for this sort of activity). Extract the folder to this drive. These are
    BIG files ("A" is over 50 MB), so if it takes a few minutes, don't panic!.
    - using Windows Explorer open eg C:\ZIP_TEMP and right click on the file
    you have transferred (you will see that it is now much bigger). Select
    "Rename" from the menu and add the extension .TAR to the file name.
    - You will now be able to uncompress this to an appropriate directory - eg
    C:\BNC - by (1) double clicking on the folder, (2) chosing "select all"
    from the "Actions" menu, and then (3) selecting "Extract" and sending the
    files to eg C:\BNC.
    - Repeat these steps for folders B.TGZ and C.TGZ. This took me some time
    to work out, but once you have understood the problem, it's easy to fix.

    · Step 3 - unpacking Folders D - K
    - In Windows Explorer, double click on an appropriate folder on the BNC
    CDROM (eg "D")
    - When Winzip asks you "Should WinZip decompress it to a temporary folder
    and open it?", select "No". A second WinZip window will open containing a
    single folder .TAR.
    - Double click on this folder and get a list of the folders contained in
    the .TAR folder.
    - In WinZip, select all these folders through Actions, Select All.
    - Extract the folders to an appropriate directory (eg C:\BNC)
    - Repeat this process for the remaining folders (ie E, F, G, H, J, K)

    You will now have a full version of the text files in BNC on your hard disk
    drive.

    · You can use the same procedure to unpack the BNC documentation. Select
    DOC.TGZ and decompress it to an appropriate folder on your PC (eg
    C:\BNCDOC). The information on the BNC documentation is invaluable as it
    tells you what is contained in each file of BNC text.

    Identifying files which might be useful

    Each file in the BNC has what is called "header" information which
    specifies exactly what is in the file, where it came from and a whole host
    of genre and contextual details. You can use this to divide the corpus
    into subsets. As an example, I will demonstrate how to separate the 10
    million word spoken corpus from the 90 million word written set. This is a
    useful way of making an initial division of the corpus into more useable
    lumps, and can be done with the "Find" tool in Windows Explorer. Once you
    have become more confident in breaking the corpus into smaller units, you
    will be able to create subsets of the corpus as you require them (eg
    fiction, business oriented texts, journalism etc)

    · Step 1 - identify all spoken texts in the BNC
    - Select your BNC folder (eg C:\BNC)
    - Open Windows Explorer, select Tools, Find and then search for all files
    containing the header information "<stext". This will identify all the
    files in the BNC which contain spoken data.

    · Step 2 - create a spoken corpus from BNC
    - Now that you know which files contain spoken corpus date, you can use
    Windows Explorer to MOVE these files to a new folder called eg C:\BNC_SP.
    - You now have two sets of data to work with - 90 million words of written
    text and 10 million words of spoken.

    You can use the same procedure for creating smaller domain or text type
    specific subcorpora (eg corpus files containing the string "wridom6" are
    classed as Written: Domain: Informative: Commerce and Finance). These have
    many practical applications for language teaching and learning, and are
    easier and quicker to handle than the full corpus (though they can be
    combined in future searches if you want to draw on the full corpus).

    [For a (much better) alternative, contact Dave Lee
    (david_lee00@hotmail.com) who has put together a really neat Excel
    spreadsheet which tells you which files contain what categories of text --
    and also make a better fist of genre than the original BNC categories.]

    Working with the corpus

    I am not going to expand on this here. I would refer readers to my own
    book (Tribble C & G Jones (1997) Concordances in the Classroom: a
    resource book for teachers Athelstan Houston TX) for an overview of
    approaches to the use of corpus data in language education. As far as
    software tools are concerned, apart from WinZip, I would recommend you get
    hold of either WordSmith Tools (Scott M 1996 WordSmith Tools Oxford
    University Press Oxford) or Monoconc Pro (Barlow M 1998 Monconc Pro
    Athelstan Houston TX).

    Best

    Chris Tribble

    --
    		Dr Christopher Tribble
    Sri Lanka	21 Wijerama Mawatha, Colombo 7
    		TEL  +94 75 332 309
    UK   		122, Queen Alexandra Mansions, Judd Street
    		London WC1 H 9DQ
    		TEL +44 171 833 4271
    UK Mailing	c/o FCO (Colombo)
    		The British Council (Sri Lanka)
    		King Charles Street, London SW1A 2AH
    E-mail		ctribble@sri.lanka.net
    Home Page	http://ourworld.compuserve.com/homepages/Christopher_Tribble
    



    This archive was generated by hypermail 2b29 : Tue Feb 22 2000 - 16:12:05 MET