Corpora: summary Brill's vs. CLAWS

From: Veronika Koller (Veronika.Koller@isis.wu-wien.ac.at)
Date: Tue Aug 14 2001 - 15:53:52 MET DST

  • Next message: Ana Ninyo: "Corpora: Spanish tagger"

    Dear list members
    some time ago (16 July), I posted a question as to the respective
    (dis)advantages of Brill's vs. CLAWS tagger.

    The following people kindly provided assistance on the topic:

    Eric Atwell, who, along with Chris Tribble, drew my attention once more to
    the Amalgam tagger at
    http://www.scs.leeds.ac.uk/amalgam/amalgam/amalghome.htm. (especially for
    ICE-GB, which we are planning to buy), pointed out the flexibility of
    Brill's (>the original version comes trained to apply Brown Corpus tagset,
    but it >can be retrained with another tagged corpus to apply another
    tagset. You can "do
    >this yourself" with your own preferred tagged corpus.<) and how the two
    are in fact related (>Using Brill's tagger is >more like "Do-It-Yourself":
    you can download the tagger software, free, from Eric Brill's homepage
    >http://research.microsoft.com/~brill/
    >then run it on your own texts yourself. Alternatively, you could try our
    >free email-server version, just email your text (plain ascii, not
    >HTML/doc/etc, and not an attachment) to amalgam-tagger@comp.leeds.ac.uk
    >with Subject: Brown) and it should be returned with the tags supplied by
    >standard Brill tagger.<) Moreover, he mentioned the great CLAWS customer
    service provided by Chris Needham at UCREL (I can corroborate that!)

    Matthew Bell, who provided a sample of text tagged with Brill's along with
    instructions for editing.

    Torsten Brants, who infored me about his own tagger available at
    http://www.coli.uni-sb.de/~thorsten/tnt/

    Another alternative was mentioned by Vlad Gojol, whose tagger can be
    obtained by getting in touch with him through gojol@rnc.ro

    Yuval Krymolowski, who reminded me of a discussion on POS tagging on this
    list earlier this year.

    Paul Rayson, who pointed out the fact that the BNC is tagged with CLAWS
    (here's some more info on the subject provided by the BNC people:
    >The POS tagging is present in the BNC, and our software uses it. For
    >reasons of scale howeever, we do not maintain a POS index to the whole
    >corpus, and therefore you cannot do searhes of the type "find me every
    >plural noun". You can however do searches such as find "lead" tagged as
    >Verb. You can also search for a word or word pattern, and then sort the
    >results by the POS codes of the words to left or right of the target.
    >We could reindex the BNC to produce such a POS index, but it would mean
    >another CD's worth of data to distribute it!)

    Mary D. Taffet, who summarized her experience with Brill's tagger as follows:
    >Advantages of using Brill's tagger:
    >-- Trainable
    >-- You can create new part-of-speech tags to handle special situations,
    >which I had to do with transcript data that was full of transcription
    >errors; there isn't a list anywhere telling you what the "proper" tags
    >are, so there are no limitations on what you can use for tags. This can
    >be an advantage (in the case of errorful data like I was working with) and
    >also a disadvantage in that you aren't informed if you make genuine
    >mistakes when hand-tagging your training corpus.
    >Disadvantages of using Brill's tagger (for training):
    >-- With sparse training data, not many rules are learned; I had to
    >manually change the threshhold value to a lower number so that more rules
    >a separate step.
    >-- The Brill tagger also expects to receive only one sentence per line,
    >but having sent through it transcripts which were broken out by speaker
    >only, I can tell you than the one sentence per line restriction isn't
    >really enforced per se (therefore not a true disadvantage).

    More on using Brill's came from Marc Vilain:
    >Another nice aspect of the approach is that
    >you can actually fix p-o-s tagging problems by manually adding to the
    >ule set; very helpful for applications where the domain strays from the
    >financial/political documents with which the system was trained (in
    >English, that is). It's MUCH less convenient to patch an HMM, as a
    >point of comparison.
    >Problems with Brill? The standard distribution of Brill's software
    >shows its origin as a laboratory experiment: text must be coerced into a
    >non-structure-preserving input format, all the lexica and rules must be
    >read up front (so start up is time-consuming), etc. You can aleviate
    >some of these problem's by using MITRE's distribution of the software
    >instead of Brill's. Our version reimplements Brill's tagger in C
    >(faster), with incremental cached lexicon access (less startup costs),
    >and a few other advantages. We have some lingering bugs/misfeatures of
    >our own (won't compile under Red Hat 7), but generally the code is
    >better than Brill's.
    >Aside from bugs, the biggest drawback of both versions of the tagger is
    >an architectural weakness of the approach, namely that the lexicon is
    >compiled from the training data. This is another artifact of the
    >laboratory origins of the software, and it means that you have to go to
    >some degree of trouble to extend the tagger to cover vocabulary of
    >relevance to a particular application. Eventually, we will address this
    >by producing yet another version of the tagger that will be wholly
    >integrated with the Alembic lexicon (this one will be in Lisp).

    On the question of whether Brill's can be used with Windows as well, both
    Mary and Marc produced valuable info:
    >One thing I should mention is that somebody has ported it to Windows with
    >an interface -- but the primary language of that port is French.
    >However, I don't know if it is possible to "call" the tagger, especially
    >for training purposes, using that port.
    >The French-based Windows version is called WinBrill:
    http://jupiter.inalf.cnrs.fr/WinBrill/winbrill.english.html
    >Again, I'm not sure how one would use this for tagging a whole document on
    >a Windows-based machine.
    >The regular Unix/Linux port of the Brill tagger is available here:
    http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z
    >There may possibly be other ports to Windows that I am not aware of. I
    >just saw the following page which I had not run across previously. I
    >haven't downloaded the files, so know nothing about >them:
    >http://www2d.biglobe.ne.jp/~htakashi/software/BRILL_E.HTM

    >Both MITRE's implementation and Brill's could be made to work on WINDOWS
    >(in principle...) under CYGWIN (which gives a UNIX-like command
    >interpreter to WINDOWS).
    (At this point, John Henderson got in:
    >The method for getting bash and tcsh and the other unix
    >tools (including gcc and such) under windows is to install the package
    >found at http://www.cygwin.com/
    >Those can help you install the taggers.<)

    Many thanks to all of you! It was really helpful.

    Regards,
    Veronika Koller

    P.S.: We tentatively decided on CLAWS.

    Mag.a Veronika Koller
    Department of English/Business English
    Vienna University of Economics and Business Administration
    Augasse 9
    A-1090 Vienna
    Tel.: 43/1/31336-4068
    Fax: 43/1/31336-747



    This archive was generated by hypermail 2b29 : Tue Aug 14 2001 - 15:48:46 MET DST