Tagger/Lemmatizer summary - at last!

John Shillaw (jds@sakura.cc.tsukuba.ac.jp)
Sun, 8 Oct 1995 01:56:06 +0900

My apologies to all _124_ members of the list who asked for a copy of the
summary I promised. I'm afraid that work interfered with pleasure and I
just didn't have the time to compile the information I received.

Anyway, better late than never, I hope, attached is the summary. There
wasn't as much available as I thought, or else I couldn't get the
information. I offer the summary as a draft copy and will hold off
sending the finished version to Knut Hofland for a couple of days to
allow any additional information to be added. There are a few gaps in
the information that perhaps people can help fill and there are bound to
be errors to be corrected.

I hope it's useful: it's certainly been an education for me.

John Shillaw
University of Tsukuba
Japan

-------------------------------------------------------------------------
SUMMARY OF POS TAGGING AND LEMMATIZING SOFTWARE (9/9/1995)

John Shillaw <jds@sakura.cc.tsukuba.ac.jp>
University of Tsukuba, Japan

This summary is the result of a request for advice I made to CORPORA-L
about the availability of tagging and lemmatizing programs.

What follows is non-exhaustive list of programs that are free or
commercially available as of the above date. Most of the information has
been supplied the members of CORPORA-L listed below, supplemented by a
little research of my own. I decided to produce a minimalist list
because almost all of the programs are extensively documented through
on-line sources. I provide the current URLS and/or contact e-mail
addresses for each program. I have made no attempt to evaluate the
programs, simply because I haven't had the time to try them. What is
clear from the information I have received is that the different programs
may or may not be suitable for different tasks.

I'd like to express my thanks to the following people for making my
search so much easier.

Evan Antworth <evan.antworth@sil.org>
Thomas Bilgram <bilgram@ling.hum.aau.dk>
John Bro <bro@lin.ufl.edu>
Pernilla Danielsson <pernilla@svenska.gu.se>
Rickard Domeij <domeij@nada.kth.se>
Alex Chengyu Fang <ucleacf@ucl.ac.uk>
Adam Kilgarriff <ak28@it-research-institute.brighton.ac.uk>
Bruno Maximilian Schulze <schulze@ims.uni-stuttgart.de>

FREEWARE

Name: BRILL-TAGGER
Function: Tagger only
Available: Anonymous FTP blaze.cs.jhu.edu/pub/brill/Programs
Platform: Can be compiled for UNIX, DOS & Mac. Compiler written in
Perl.

Name: XEROX-TAGGER
Function: Tagger only
Available: Anonymous FTP parcftp.xerox.com:/pub/tagger/
Platform: UNIX
Comment: http://www.xerox.com/lexdemo/xlt-overview.html for more
information

Name: PC-KIMMO 2 (supercedes version 1 - still available)
Function: Morphological analyzer
Available: ftp://ftp.sil.org/software/dos/pc-kimmo/pck20b27.zip
(DOS/Windows)
ftp://ftp.sil.org/software/mac/pc-kimmo/pc-kimmo20b27.sea_hqx (Mac)
ftp://ftp.sil.org/software/unix/ (UNIX sources)
Platforms: DOS, Windows (command-line interface using Windows memory
management), UNIX
Comment: Can be used as the 'engine' for a tagger front-end program.
Comment: More information at http://www.sil.org/pckimmo/pc-kimmo.html
Comment: Requires ENGLEX to work.
ftp://ftp.sil.org/data/pc-kimmo/dos/engl20b4.zip (DOS/Windows)
ftp://ftp.sil.org/data/pc-kimmo/mac/englex20b4.sea_hqx (Mac)
ftp://ftp.sil.org/data/pc-kimmo/unix/englex20b4.tar_z (UNIX)
ftp://ftp.sil.org/data/pc-kimmo/unix/englex20b4.zip (UNIX)
Comment: SIL offers other programs for text analysis. Check out:
http://www.sil.org/
gopher://gopher.sil.org/
Comment: There is a list, PC-PARSE, for discussion of SIL's software.
Contact MAILSERV@SIL.ORG: include SUBSCRIBE PC-PARSE in the body
of the message.

Name: AD ENGLISH LEMMATIZER
Function: Lemmatizer
Available: Contact Max Schulze
Bruno Maximilian Schulze <schulze@ims.uni-stuttgart.de>
Platform: ?
Comment: Max describes the program as "...a dictionary lookup tool
with lemma-disambiguation using the part-of-speech
information."

COMMERCIAL SOFTWARE

Name: AUTASYS
Platform: DOS
Speed: c. 15,000 words per minute on 486 DX2. Accepts text and SGML-
type input
Interface: Menu driven
Tagsets: LOB, ICE (International Corpus of English), and SKELETON
Lemmatiser: Yes. For the three tagsets.
Contact: Alex Chengyu Fang
Address: Survey of English Usage
University College London
UK
E-Mail: ucleacf@ucl.ac.uk
Price: Contact Alex Fang

Name: ENGTWOL
Function: Tagger and lemmatizer
Platforms: ?
Price: Contact info@lingsoft.fi
Comment: Languages other than English available.
Comment: For an outline of the program send an empty message to
engcg-info@ling.helsinki.fi. No subject needed.
Comment: There's a limited tagger service by e-mail for evaluation.
Retrieve the program outline as explained above for full
instructions.
Comment: For information about the program functions, contact:
Atro Voutilainen (preprocessor, ENGTWOL lexicon, and the
disambiguation constraints for morphological ambiguities)
Juha Heikkila (ENGTWOL lexicon)
Arto Anttila and Timo Jarvinen (Constraint Syntax)
Contact: FirstName.LastName@Helsinki.FI,
e.g. Atro.Voutilainen@Helsinki.FI

OTHER SOURCES OF INFORMATION

On-line

Natural Language Software Registry The best source I found for
information on a range of software tools for all aspects of text
analysis. Contains more detailed information on most of the above
programs.

Center for Lexical Research

Corpora-L

Print

Adam Kilgarriff pointed me in the direction of the DECIDE document. In
Adam's own words;

"EU Project DECIDE (Describing and Evaluating Extraction Tools for
Collocations in Dictionaries and Corpora) includes quite a lot of info
relevant for your purposes. It includes a good summary of taggers
(both techniques and systems)."

I don't where copies can be obtained, but Max Schulze was one of the
authors, so he should know.

Corrections and up-dates to this document should be addressed to Knut
Hofland <Knut.Hofland@hd.uib.no> as the owner of CORPORA-L, not to me.