Re: Corpora: Corpus Linguistics User Needs

Chris Brew (Chris.Brew@edinburgh.ac.uk)
Fri, 31 Jul 1998 14:46:04 +0100

>In relation to Oliver Mason and Ylva Berglund's suggestion that Corpus
>Linguists might find it useful to learn how to develop at least some of
>their own essential tools, I'd like to put in my own twopennyworth in
>agreement.
>
>I constantly find I need to do something pretty basic on a text file
>without being able to do it in Word (or not knowing how). For example read
>it and say make a copy but only using the lines which start with > (or in
>the case of emails maybe do NOT start with >!).
>
>Since I do know how to write computer programs I can solve the problem; my
>colleagues and students mostly don't, and usually end up doing without, or
>struggling for ages to get it done with a most unfriendly and inefficient
>Word Basic macro.
>
>To learn the simple basics, enough to know how to open up a text file, read
>it line by line, effect some sort of changes or counting procedure on each
>line, and save the results is not very tricky.
>
>For what it's worth, my opinion is that the best solution would be a kind
>of published mini-course, together with human feedback in the form of a
>workshop. The published mini-course could be presented by Internet. The
>feedback might need to be in two ways: by emailing the authors (who'd pay
>them for replying?) and when enough folks got interested, organising a
>proper hands-on seminar somewhere suitable. This need not be especially
>expensive if a number of Corpus Linguists were interested. I would
>willingly collaborate.
>
>******************************************
>Mike Scott, author of WordSmith Tools and MicroConcord
>Applied English Language Studies Unit
>University of Liverpool, Liverpool L69 3BX
>http://www.liv.ac.uk/~ms2928/homepage.html
>http://www.liv.ac.uk/~ms2928/wordsmit.htm

My course notes (with Marc Moens) on Data-Intensive Linguistics (tending
towards book) at http://www.ltg.ed.ac.uk/~chrisbr/dilbook/
have some sections which might be useful for the tutorial which you
suggest. There are still some holes in the book, but it is becoming more
complete.

Incidentally, we chose the term Data-Intensive Linguistics because we
weren't sure whether what we do is recognizable to Corpus Linguists as
Corpus Linguistics, and didn't want to tread on any toes. The corresponding
course is clearly more technical than McEnery and Wilson's Corpus Linguistics
book. We anticipate that many of our students will be or become programmers,
but some of them will be able to do most of what they need by creative
combination of existing tools (especially easy when operating under
Unix, which has nice constructs for this purpose).
But there is obvious scope for tools like CQP and our own XML tools which
reduce the need for custom programming in common tasks. [If you work with
SGML corpora, you will want to investigate 'sggrep', SGML aware version of
grep, which is part
of XML toolset about which we are shortly presenting a tutorial at Coling 98
in Montreal].

The real challenge is to make tools which are flexible enough to let
people with only a limited interest in programming to substantial
creative work, while avoiding the danger of cramping the style of
those who can build their own tools. And, speaking as one of the
latter, I still prefer it when I can do the job without writing new
programs of my own.

Best

Chris

Email: Chris.Brew@edinburgh.ac.uk
Address: Language Technology Group, HCRC,
2 Buccleuch Place, Edinburgh EH8 9LW,Scotland
Telephone: +44 131 650 4632 Fax: +44 131 650 4587