Re: Corpora: Suitable software for producing lemmatised conc

Daniel Ridings (ridings@svenska.gu.se)
Sat, 7 Feb 1998 10:51:36 +0100 (MET)

There are probably few languages that can be lemmatized without having
POS-tagged the material first. I am working with Swedish and two Bantu
languages (Shona and Ndebele) and the lemmatization problem is accute.

I'm not so sure Adam and I really disagree, but I wouldn't say that
POS-tagging is more difficult than lemmatization ... just try lemmatizing
without access to POS, and that will be difficult :) Maybe we could say
that the one makes little sense without the other.

What I do is 1) POS-tag Swedish, for example. Only afterwards do I
lemmatize. This second step filters out quite a lot of erroneous POS-tags
because I only allow a lemma to be asigned if the word-type in the lexicon
can have the same potential analysis as proposed by the POS-tagger. (The
tagger's lexicon is _much_ smaller).

Having done this, it is a simple matter to encode it in IMS Corpus
Workbench suite. The lemma just becomes another attribute to the
word-type, on the same level as the POS. To use a fictional English example:

[lemma="run" pos="VPRT" word="ran"]

The IMS query processor then allows searching for any of the attributes,
of which 'word' is only one. (The "status" of the lemma ... I guess that's
a whole new story. "He's in the running" ... what is the _lemma_ of
running? Is it more natural to search for "running" or for all forms of
"run" (lemma) that are tagged as "noun"?)

I'm presently doing this for 10 million running words in Swedish (already
on the net http://ldb20.svenska.gu.se with POS attributes) and will
probably go up to 20 million (the whole PAROLE corpus).

The necessity of lemmatization for Bantu languages, particularly those
with "conjunctive" orthographies is acute. So many pronouns, objects,
tempus markers and even conjunctions are stacked on before the lexical
unit and a not so little collection of goodies lines up after the lexical
unit ... everything in one single orthographical word-type. An
alphabetical concordance is more or less useless. Most of our material
consists of transcribed interviews and you sure do get a lot of first and
second person singular pronouns stacked up in a concordance, followed by
a half a dozen tempus markers, followed by a set of object forms all
sorted together before you get to the "lemma". I haven't gotten as far as
with Swedish. I have the lexicon and can do the analysis, but there sure
is a lot of ambiguity.