Re: Corpora: Re: Corpus Linguistics User Needs

Ted E. Dunning (ted@aptex.com)
Tue, 4 Aug 1998 14:16:50 -0700

Here is my wish list. Note that there is a lot of interaction between
the features. Note also that I have need of all of these operations
on nearly a daily basis. Unfortunately, the software that I use isn't
available for distribution.

a) the ability to produce coocurrence counts of a relatively general
nature. In particular, my experience is that it is important to be
able to count (all|some) (words|stems|other) which occur (to the
left|to the right|on either side) within the same
(sentence|paragraph|document) and within n words of a
(word|stem|phrase|other).

b) it should also be possible to search for relatively complex
sequences of words with interesting combinations of attributes. It
generally requires a fairly advanced inverted index to make this
feasible.

c) the coocurrence counting should be integrated into the search
capability so you can find the things which appear near specified
points in a corpus.

d) for people interested in oriental languages, it should be possible to
view a text as simultaneously segmented into words and also not
segmented into words. It should be possible to have alternative
segmentations present in the corpus at once.

e) operations which are done to annotate the corpus should be coupled
back into the search and coocurrence counting operations. This
coupling should ideally be in real time.

f) there should be a simple extension language available so that users
can script the operations involved. I strongly recommend TCL as the
extension language since it (1) easy for users to learn and (2) has
technical characteristics which make it ideal for layering over
complex software systems.

This last feature is very important even if you don't think that
linguists should know how to program :-) If it is done well, then the
recalcitrant linguists won't even know that they *are* programming.