Re: Corpora: Corpus Linguistics User Needs

Arvi Hurskainen (ahurskai@ling.helsinki.fi)
Wed, 29 Jul 1998 17:59:57 +0300 (EET DST)

In response to the issue raised by Oliver Mason, and already responded by
Geoffrey Sampson and Henning Reetz, I should say that that most ready-made
packages, if not all, are insufficient for the needs of a linguist. I
fully agree with Reetz that writing in an ad hoc manner programs that
really do the jobs I want is not an easy task.
In working with languages which inflect to the left and right,
mostly to the left (having prefixes rather than suffixes), I find simple
concordance programs and listing and counting tools inadequate. The approach
which I have been developing relies on the linguistic analysis of the text,
and search is then directed to the analyzed text. This approach opens
up quite different kinds of possibilities for corpus research than
general purpose tools. The difference is very clear, because in this approach
one can reliably search for lemmas (hidden inside the word-form),
get morphological data, part-of-speech specifications, and even the syntactic
position/function of each syntactic unit.
It is like searching from a morphologically and syntactically
tagged corpus, but the text itself is just ordinary plain text, and the
system tags the text in flight.
Tagging arbitrarily selected text automatically without manual
checking has risks, but the system has improved all the time. One does not
program such a system overnight. For Swahili, for example, the development
has taken 13 years now, and the work goes on. The system is based on two-level
morphology with a language-independent analysis program and a good
rule compiler, which have been devised by entirely other people. Yet the
language-specific rules and a dictionary are the work of an individual
linguist. In further processing small scripts written in Awk, Perl, Lex
etc. are handy and useful. Linguists can (and should) learn to write those.
I should say that this kind of approach is necessary, if we want
to make the needs of a linguist and the capabilities of programs meet.

--Arvi Hurskainen

-- 
*******************************************************************************
PLEASE NOTE THE NEW POSTAL ADDRESS!!!
Arvi Hurskainen, Professor            Arvi.Hurskainen@ling.helsinki.fi (e-mail)
Department of Asian and African Studies               +358 9 191-22677 (phone)
Box 59, (Unioninkatu 38 B)                            +358 9 191-22094 (fax) 
FIN-00014 University of Helsinki             
Finland