Re: Corpora: Morfological ambiguity

From: Diana Maria de Sousa Marques Pinto dos Santos (Diana.Santos@informatics.sintef.no)
Date: Mon Jan 08 2001 - 12:55:04 MET

  • Next message: Diana Maria de Sousa Marques Pinto dos Santos: "Corpora: COMPARA: the Portuguese-English parallel corpus"

    At 20:55 07.01.01 +0000, Hristo Tanev wrote:
    >Dear all,
    >I work in the area of ambiguity resolution for
    >Bulgarian. I obtained the following result concerning
    >morfological ambiguity.I have measured the ratio
    >
    >
    >Number of all morf.hypothesis for all words/Number of
    >words
    >
    >
    >
    >This ratio for Bulgarian is 1.27-1.33 and doesn't vary
    >too much.
    >My question is : does someone of you know this average
    >ratio for English or other language? Does this ratio
    >depend on the genre?
    >
    >Hristo
    >

    Dear Hristo,

    There are several vague things in your mail. What is "all words"? And: are
    looking at tokens or types?

    I assume that you are measuring in a corpus, so "all words" are all words
    that occur in the corpus and not in a lexicon or any list. And that a high
    frequent word like a preposition is counted as many times as it appears in
    the corpus, i.e., I assume you are counting TOKENS.

    If this is true, you still have to explain what morphological ambiguity are
    you interested in - from between grammatical category (PoS) only to ANY
    possible, even if it is systematically ambiguous in your language. For
    example, would you consider ARE (the form of the verb BE in English) 4
    times ambiguous, or not ambiguous at all, since English never makes a
    morphological distinction between plural forms and the second person singular?

    A long time ago (Medeiros et al. 1993) we made some measures for European
    Portuguese, based on tokens in a corpus, measuring only POS ambiguity, and
    then only between 4 kinds: word belonging to a closed class, verb,
    noun/adjective, or past participle. The number obtained was 1.02494. Not
    counting words only belonging to a closed class, the number raised to
    1.1398, but I would advise you to look more carefully both at the setup and
    at what the measures may mean, before you directly compare languages (if
    that is what you have in mind).

    Other ambiguity measures for Portuguese that I know of are the ones
    published in Eckhard Bick's recent dissertation, and in Bacelar do
    Nascimento et al. (1993). References follow:

    Bacelar do Nascimento, Maria Fernanda, José Bettencourt Gonçalves, Lucília
    Chacoto, Paula Neto & Luísa Alice Santos Pereira. 1993. Ambiguidade
    morfológica no Português Fundamental. In Actas do 1.o Encontro de
    Processamento de Língua Portuguesa (Escrita e Falada) - EPLP'93. Lisboa,
    25-26 de Fevereiro de 1993, pp.101-106.

    Bick, Eckhard. 2000. The Parsing System "Palavras". Automatic Grammatical
    Analisys of Portuguese in a Constraint Grammar Framework. Aarhus University
    Press.

    Medeiros, José Carlos, Rui Marques & Diana Santos. 1993. Português
    Quantitativo. In Actas do 1.o Encontro de Processamento de Língua
    Portuguesa (Escrita e Falada) - EPLP'93. Lisboa, 25-26 de Fevereiro de
    1993, pp.33-38.

    I hope these may be useful at least to those who are interested in
    Portuguese :-)
    Diana

    **************************************************************************
    Diana Santos Computational processing of Portuguese

    SINTEF Telecom and Informatics Tel. (direct line) +47 22 06 73 12
    Forskningsveien 1 Tel. +47 22 06 73 00
    Box 124 Blindern Fax. +47 22 06 73 50
    N-0314 Oslo Email: Diana.Santos@informatics.sintef.no
    Norway http://www.portugues.mct.pt/
    **************************************************************************



    This archive was generated by hypermail 2b29 : Mon Jan 08 2001 - 12:51:48 MET