Re: Question about the Xerox POS tagger

Ken Beesley (Ken.Beesley@grenoble.rxrc.xerox.com)
Wed, 2 Oct 1996 13:44:20 +0200

Dear Mr. Camara,

The publicly available Xerox Part-of-Speech Tagger Trainer is a framework
for building specific taggers (like your Portuguese system). It's in LISP,
was released some years ago, and is no longer maintained. The authors
(Cutting and Pedersen) no longer work at Xerox. However, a number of people
have been able to get it to work. I'll forward your message to some
people who _may_ be able to help.

In the meantime, you might want to take a look at a Portuguese tagger
(and taggers and morphological analyzers for other languages) built at the
Rank Xerox Research Centre

http://www.xerox.fr/grenoble/mltt/Mos/Tools.html

and licensed by XSoft

http://www.xerox.com:80/XSoft/lexdemo/xlt-welcome.html

These taggers use a Xerox HMM technology which is not public-domain.

As for your specific question regarding the format of the training.txt file,
our surviving example for an old French tagger shows a training file with one
token per line.

Boa sorte,

Ken

> From corpora-request@lists.uib.no Mon Sep 30 13:53 MET 1996
> Resent-From: corpora-request@lists.uib.no
> Resent-Message-Id: <199609301254.NAA08370@pinea.xerox.fr>
> Old-Received: from nora.hd.uib.no by noralf.uib.no with SMTP (PP); Mon, 30 Sep
> 1996 14:30:23 +0200
> Old-Received: from isolde (root@isolde.di.fct.unl.pt [192.68.178.191]) by
> nora.hd.uib.no (8.7.5/8.7.3) with SMTP id NAA29999 for
> <corpora@hd.uib.no>; Mon, 30 Sep 1996 13:34:11 +0100 (MET)
> Old-Received: from alfa.fct.unl.pt (alfa.di.fct.unl.pt) by isolde with SMTP id
> AA05850 (5.65c/IDA-1.4.4 for <corpora@hd.uib.no>); Mon, 30 Sep
> 1996 12:30:08 GMT
> Old-Received: by alfa.fct.unl.pt (5.57/Ultrix3.0-C) id AA15276; Mon, 30 Sep 96
> 12:30:07 GMT
> Date: Mon, 30 Sep 96 12:30:07 GMT
> From: jtc@di.fct.unl.pt (Jose T.A. Camara [gpl])
> To: corpora@hd.uib.no
> Subject: Question about the Xerox POS tagger
> Resent-Date: Mon, 30 Sep 1996 14:30:31 +0200
>
> Subject: Question about the Xerox POS tagger
>
> Dears all,
>
> I have been working on my MS thesis, on statistical NLP, and I have to run
> The Xerox Part-of-Speech Tagger for the portuguese language, but unfortunately
> it fails before generating the HMM, more exactly when accessing the file
> "training.txt".
>
> It runs pretty well for the english language.
>
> I have modified the tagger (adapting to the portuguese language) strictly
> accordingly to the Xerox document, that is, creating a portuguese lexicon
> (exactly the same structure as the english one), based upon a portuguese
> corpus, and appropriate open classes, symbol and transition biases,
> as well as specifying all required new paths.
>
> The modified tagger (tag-brown.lisp), includes the commands to compile and
> load the tag-trainer and to "train-on-files" on the "training.txt", a file
> with some text in portuguese.
>
> I have no references about the strucure of this "training.txt" file, thus
> I do not really know if it requires any special structuring, or if the
> failure is due to this fact.
>
> Should this file also include tags? If yes, in what structure?
>
> Note: the tagger fails during the execution of the command:
>
> (pdefsys:load-system :tag-english)
>
> right after opening the training.txt file.
>
>
> I appreciate any help/orientation in order to solve this problem
>
> Thank you so much
>
> Jose Camara (jtc@fct.unl.pt)
> My environment is:
>
> System: SunOS Release 4.1.3_U1 (GENERIC+MZ+MULTICAST)
> Lisp: CMU Common Lisp 17f
> Tagger: tagger-1.2.tar
> Guide: The Xerox Part-of-Speech Tagger Version 1.0 document
> by Doug Cutting and Jan Pederson
> Executing successfully the following instructions:
> (compile-file "src/pdefsys")
> (load "src/pdefsys")
> (pdefsys:compile-system :tdb-sysdcl)
> (pdefsys:load-system :tdb-sysdcl)
> (pdefsys:compile-system :tag-english :propagate t)
>
> -------------------------------------------------------------------
> Universidade Nova FCT
> Lisbon 27 of September of 1996
>

*******************************************************************
Kenneth R. Beesley ken.beesley@xerox.fr
Rank Xerox Research Centre Tel: (33) 76 61 50 64
6, chemin de Maupertuis Fax: (33) 76 61 50 99
38240 MEYLAN, France http://www.rxrc.xerox.com/grenoble/mltt/