Corpora: Working with BNC data under Windows

Christopher Tribble (ctribble@lanka.ccom.lk)
Sat, 2 Jan 1999 10:08:31 +0530

Following Lou Burnard's helpful fix (+ thanks to Knut Hofland and Douglas
McCarthy for suggestions), I can now write a note for anyone who wants to
work with the BNC text files under Windows.

The data files are on CDROM BNC101 in 10 files (A.tgz through to K.tgz). To
unpack them for use with DOS or Windows software you need around 5 GB free
hard disk space + a programme such as Winzip (http://www.winzip.com). There
appear to be differences between the ways in which Files A, B and C were
archived compared with the others, so two different approaches are needed
for unpacking.

FIRST
Create a set of appropriate subdirectories for holding the unpacked files
(eg D:\BNC\A, D:\BNC\B etc.)

THEN
Steps for unpacking A, B, and C are:
- Use Winzip to unpack eg File A to a temporary directory (eg C:\TEMP)
- You should now have three files simply called A, B and C (n.b. they have
no file suffix)
- Rename these files A.TAR, B.TAR and C.TAR
- Unpack each file to an appropriate directory (eg D:\BNC\A)

FINALLY
Steps for unpacking D through to K are easier. Just double-click on eg
D.tgz. Under Windows 95/NT Winzip will tell you that it contains an archive
called D.tar. and ask you if you want to unpack it to a temporary
directory. Click "yes" and then unpack these files to the appropriate
directory.

You will end up with 10 directories holding the BNC text files (around 3.5
GB of data).

I've only just started playing with these, but there seems a lot of
potential for using WordSmith to work with subsets of BNC. This is possible
because later versions of WordSmith Tools are becoming much more SGML aware
(I'm currently using 2.00.35 - 10/11/98). With this you can elect to only
work with eg Natural and Pure Science texts taken from published books for
an advanced audience:

eg select only texts where the header contains the strings <wridom3 wrimed1
wrilev3>

Another way of working which is proving useful is to use Windows Explorer
to construct sub-corpora using the header information - I've got three
which contain PUBLISHED BOOK / SOCIAL SCIENCE and Audience Levels 1, 2, and
3.

Working one way or the other, you can then use Wordsmith to generate
wordlists, keyword lists, concordances etc. All without leaving the comfort
of your technically inferior, but oh so familiar Win/Dos environment!

I'm using a Pentium 333 with 128 MB Ram and two 8 GB hard disks.
Generating a complete wordlist for BNC took about 20 minutes. A search for
"huddle" (why not!) across the full corpus took around 4 minutes. Working
with the small subcorpora is, obviously, much quicker. I'm sure that I'll
set up SARA at some point soon - in the meantime, it's good being able to
do this sort of thing at last.

Chris

--
Sri Lanka	21 Wijerama Mawatha, Colombo 7
		TEL  +94 75 332 309
UK		122, Queen Alexandra Mansions, Judd Street
		London WC1 H 9DQ
		TEL +44 171 833 4271
UK Mailing	c/o FCO (Colombo)
		The British Council: Sri Lanka
		King Charles Street, London SW1A 2AH
E-mail		ctribble@serendib.ccom.lk
Home Page	http://ourworld.compuserve.com/homepages/Christopher_Tribble