Re: Spanish

Chris Brew (chrisbr@cogsci.ed.ac.uk)
Mon, 17 Apr 95 11:22:15 +0100

> Hi,
>
> I am searching for two pieces of information:
>
> 1. Could someone point me toward some literature, documentation or code
> (preferably C) relating to part of speech taggers? Specifically, I am
> hoping to find info on Spanish taggers; or, if nothing specific to
> Spanish is available, I would be interested in algorithms used for other
> romance languages.
The standard algorithm for part of speech tagging uses Hidden Markov
Models, and seems to be applicable to a wide variety of languages.
It is quite likely that you will be able to use this algorithm
unchanged. What you may have to worry about is details of the size
and nature of the tag set, the amount of training material which you
have available and the extent to which you give the tagger hints to
bootstrap the training process.

The most widely used and conveniently available tagger is one implemented
in Common Lisp by a group at Xerox Parc. It comes with documentation
and works very well. Available from ftp://parcftp.xerox.com/pub/tagger
when I last looked. Latest version is tagger-1.2.tar.Z, which I haven't
tried. If you need to install Common Lisp to run it, there are several
good free implementations.
(cf. The Association of Lisp Users home page
http://www.cs.rochester.edu/users/staff/miller/alu.html)

There is a paper by Briscoe, Grefenstette, Padro and Serail
called "Hybrid Techniques for Training HMM POS Taggers" which includes
work on Spanish. I have an early draft published as a Rank Xerox
Research Centre Reoprt MLTT-007, but it has probably been
published somewhere as well. Contact grefen@xerox.fr for a
copy of the latest version. The paper is listed in but not
directly available from
http://www.xerox.fr/grenoble/mltt/reports/home.html

(non-consitituent coordination, yes!!! PP/pn --> PP/pn and PP/pn,
and I didn't do it on purpose)
>
> 2. Does anybody know of a Spanish corpus marked up for part of speech,
> or even something in the format of a Spanish lexicon, which is available
> on-line for public consumption (To be used, hopefully, in the creation
> of a Spanish tagger)?
The Briscoe et al paper reports a 17k word tagged corpus, and gives
a reference to I. Moreno-Torres (1994) A Morphological
Disambiguation Tool: application to Spanish, Aquilex-II working
Paper 24. Universitat Politechnico de Catalunya. I don't know
if that is publicly available. Please let me know if you find
out anything more.
> Any information which you can supply would be greatly appreciated.
>

Chris

The Language Technology Group of the Human Commuication Research Centre
(a UK ESRC funded interdisciplinary institution spanning several
departments of the Univerisities of Durham,Edinburgh and Glasgow)
provides a free enquiry service for Natural Language Software. More
extensive support and help available by negotiation.

A WWW interface is available on:

http://www.cogsci.ed.ac.uk/~chrisbr/langsoft.html.

This address may change, since we hope to integrate our services
with those of other initiatives in Europe.

------------------------------------------------------------------
Dr Chris Brew,
Language Technology Group,
The University of Edinburgh
Human Communication Research Centre
------------------------------------------------------------------
Email: Chris.Brew@edinburgh.ac.uk
Work Address: HCRC, 2 Buccleuch Place, Edinburgh EH8 9LW
Scotland
Work Telephone: +44 131 650 4631
Work fax: +44 131 650 4587
------------------------------------------------------------------
Home Address: 13 Kilmaurs Road, Edinburgh EH16 5DA
Scotland Home Telephone: (+44 131 662 0574)
------------------------------------------------------------------