Corpora: Propose a toolkit?!

Dave Moffat (moffat@cardiff.ac.uk)
Wed, 5 Aug 1998 11:54:03 +0100

> From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]On
> Behalf Of Ted E. Dunning
>
> Here is my wish list. Note that there is a lot of interaction between
> the features. Note also that I have need of all of these operations
> on nearly a daily basis. Unfortunately, the software that I use isn't
> available for distribution.
>
> ... cut...
>

Yes, all sounds very desirable in a bag-o-tricks, which would be a
great solution to linguists who want to program without having
to learn too deeply how to do it in full generality.

It sounds to me like what a lot of people would want is a
public domain, web-downloadable, specialist programming library
of tools for linguists.
All the things Ted Dunning mentioned to be included;
textual scanning & counting (common counts built in, but with
scripting extensions for those who want to count unusual things);
platform independent; fast; maintained (hopefully); open source
so that the more expert programmer-linguists can contribute more tools;
statistics (again, common things to be easy, plus scripting for more
imaginative or novel approaches); and all to be tied together with
scripting language, to take care of sequences of analysis stages.

One thing that Ted said makes me disagree a little: he likes Tcl.
I use tcl/tk, but without liking it much. It's idiosyncratic
and (to me) obtuse as well as not very efficient.
There are other languages around that can do the same job,
and were better designed from the start (tcl was not initially
intended to do what it gets used for nowadays, I believe);
but that are cleaner and faster and easier to use.
One language that lots of people prefer is Python (www.python.org)
but I must admit that debate "tcl versus python versus awk versus...."
is rather a controversial one, so I don't want to say "X is best".
Only be careful before you jump in and commit to one language
that somebody says is good: there are others, and it is a decision
that deserves some thought and advice from different experts.

The other significant thing that Ted Dunning said is that he
uses bits of software every day (which are therefore ideal
for our bag-o-tricks) .... but that they are not all public domain.
That is understandable: to write good, highly efficient statistical
analysis software is not easy, and so it is not freeware.

Apart from all that qualification, the time seems right to make a
more concrete suggestion -- why don't we all get a bag-o-tricks
together?!!

There already is a lot of stuff out there, to make a basic start.
But it is the sort of project that could interest a funding agency,
in some country or other: the project to build and maintain
such a software base could pay two or three linguist/programmers
full-time, there would be publishing opportunities for them,
and they would find it fun.

Everybody could benefit from a public domain toolkit as we have
been discussing. It would go a long way to widening out the bottleneck
that has been identified earlier in this discussion thread
(that is, linguists having to learn all that programming);
it would avoid duplication of effort (goodness knows how much
of that there is at the moment); it would give everyone a level
platform from which to start research into corpora etc;
and it would take the tedious, unproductive but necessary programming
and debugging work out of linguistics research.

The problem is to make quality software freely available.
But given the attractive benefits, and the fact that otherwise
the whole field will be held up for years (equivalently),
I'm sure there must be a way, if there's a will....

There are some great successes in freely-available software.
Just look at Richard Stallman's GNU project, and the Linux
operating system! Believe it or not, Linux is superior to
Microsoft Windows (95/NT/98/...99...2001... etc).
(Windows looks nice, has lots of features in its interface,
but as an *operating system* Linux is more reliable, faster,
more programmable and versatile altogether. I use them both,
so am confident of this opinion but you are of course welcome to yours.)
Most people on this list have used GNU software, maybe without
always realising it.

... So it can be done. Richard Stallman himself may well be interested
in such a project, and it would do no harm to ask his
advice in any case (rms@mit.edu I think). (and http://www.gnu.org).

There are other technical issues (scripting language or C++ for
the stats parts: then if C++ platform independence is harder,
because you have to maintain lots of different binaries etc etc....)
but in principle, isn't this the sort of thing that most
of us would like to see happen?

David Moffat