Re: [Corpora-List] Annotation without lexicons

From: Miles Osborne (miles@inf.ed.ac.uk)
Date: Tue Jan 28 2003 - 12:11:16 MET

  • Next message: Mcenery, Tony: "[Corpora-List] Book Series Announcement"

    this could be tackled as a bootstrapping problem: given some (possibly) limited
    annotated training set and two (or more) POS taggers etc initially trained on
    that data, *cotrain* between both taggers w.r.t the unannotated data.

    at this year's eacl, we have a paper that in part deals with cross-genre
    bootstrapping. for you, one could imagine viewing Old Spanish as being from a
    different genre to Modern Spanish (just!).

    Miles

    Quoting Mark Davies <mdavies@ilstu.edu>:

    > Corpus annotation is of course usually done with the aid of a lexicon
    > containing POS and lemma information. But imagine that you need to tag
    > and
    > lemmatize a 1-2 million word corpus of a language for which you do not
    > have
    > a lexicon. A variant of this might be the need to annotate a corpus
    > from
    > the older stage of a language -- e.g. Middle English or Old Spanish --
    >
    > which is related to a modern language for which you do have a lexicon.
    > How
    > is this best done?
    >
    > I've had to address this issue in creating several different corpora and
    >
    > have developed my own approach to the problem, but I'm interested in
    > alternate approaches that others might have taken. I realize that this
    >
    > might be a FAQ, but any pointers to relevant literature would be
    > helpful. Thanks in advance.
    >
    > Mark Davies
    >
    >
    > ====================================================
    > Mark Davies, Associate Professor, Spanish Linguistics
    > 4300 Foreign Languages, Illinois State University, Normal, IL
    > 61790-4300
    > 309-438-7975 (voice) / 309-438-8083 (fax)
    > http://mdavies.for.ilstu.edu
    > ** Historical and dialectal Spanish and Portuguese syntax **
    > ** Corpus design and use / Web-database scripting / Distance education
    > **
    > =====================================================



    This archive was generated by hypermail 2b29 : Tue Jan 28 2003 - 12:58:02 MET