Corpora: Re: Subsets and "partially-tagged" corpora

From: Diana Maria de Sousa Marques Pinto dos Santos (Diana.Santos@informatics.sintef.no)
Date: Thu May 11 2000 - 13:17:38 MET DST

  • Next message: James L. Fidelholtz: "Re: Corpora: Subsets and "partially-tagged" corpora"

    Dear Mark,

    >I am considering an alternative scheme in which I tag just the most common
    >words/forms for a given syntactic or verbal category, such as the 100 most
    >common nouns and infinitives, the 25 most common adjectives, the 35 most
    >common preterites, etc. The "tagged" elements would be identified by a

    It is not clear what you mean by "if I tag just the 100 most common
    infinitive forms", since afterwards in your mail you ask for a quantitative
    idea of how much would these cover in all infinitive population.

    In order to know which are the 100 most common infinitive forms in your
    corpus, I suppose you would have to count (and order) all infinitives -- or
    at least all your candidates to infinitives -- by their number of
    occurrences, and that would already give you the estimate you are looking
    for. Or are you making use of an external source of reference for the list
    of most frequent items?

    As far as categorial ambiguity, of which you say you are aware, that's
    precisely why you would need to tag (and not only have a morphological
    analyser to classify) the occurrences of your corpus. If every form could
    only be an infinitive or a noun, it was enough to have a lexicon plus a
    morphological analyser. It is precisely because most FORMS can belong to
    different categories that you need to tag a corpus.

    You ask for similar studies done on related languages. I'm not sure how
    similar -- or interesting to you -- the following is: We conducted some
    years ago a study on Portuguese on "partial tagging" undesrtood in a
    different way: we used very broad categories (only six), multitagged a
    small corpus with them (i.e., assigned all possible tags to each wordform,
    with our configurable morphological analyser), and then studied the amount
    of manual revision required to achieve one category per wordform (fully
    disambiguated corpus, thus). This is reported in Medeiros et al. (1993) and
    Santos (1996b) [both in Portuguese], available from
    http://www.portugues.mct.pt/Diana/public.html

    For a service of more imediate interest, I suggest you consult our AC/DC
    service which serves modern Portuguese corpora at
    http://cgi.portugues.mct.pt/acesso/. Two of the corpora are already parsed
    with Eckhard Bick's CG parser for Portuguese, and we hope to have the
    parsed version of the remaining corpora ready soon.
    Look for the distribution of infinitives, or adjectives (Example:
    [pos="ADJ"]; ) and select, in the field "Resultado", the option
    "Distribuição de lemas" ('Lemmata distribution'), and see whether the
    result can be of use to you.

    Greetings,
    Diana

    **************************************************************************
    Diana Santos Computational processing of Portuguese

    SINTEF Telecom and Informatics Tel. (direct line) +47 22 06 73 12
    Forskningsveien 1 Tel. +47 22 06 73 00
    Box 124 Blindern Fax. +47 22 06 73 50
    N-0314 Oslo Email: Diana.Santos@informatics.sintef.no
    Norway http://www.portugues.mct.pt/
    **************************************************************************



    This archive was generated by hypermail 2b29 : Thu May 11 2000 - 13:17:13 MET DST