Re: [Corpora-List] Searching for NE annotated portuguese corpora...

From: sandra@icmc.usp.br
Date: Wed Sep 08 2004 - 13:21:29 MET DST

  • Next message: Bernhard Schröder: "[Corpora-List] CfP GLDV05"

    Thamar,

    have a look at Lácio-Web Project:
    http://www.nilc.icmc.usp.br/lacioweb/english/index.htm

    where you can download the MAC-MORPHO corpus besides using the tools associated
    with this corpus. This can be of same use for you as MAC-MORPHO contains
    1.167.183 words of journalistic texts extracted from ten sections of the daily
    newspaper Folha de São Paulo, 1994 and the tagset
    (http://www.nilc.icmc.usp.br/lacioweb/english/ConjEtiquetas.htm) uses
    additional tags besides the traditional POS ones.

    There is some more information about it below. I hope this helps.

    Sandra Aluísio
    NILC - University of São Paulo
    http://www.nilc.icmc.usp.br/nilc/index.html

    ------
    MAC-MOPRHO is available for download in two versions:

    1) Version for linguistic research using frequency counters and concordancers,
    for instance. This format preserves all tags included in MAC-MORPHO´s Tagging
    Manual. Some files also contain XML tags for filename, title, subtitle,
    paragraph, and sentence, which were generated by the “Palavras” parser. You may
    also download this version by separate scetions or by individual texts.

    2) Version adequate for training taggers. This version does not contain the tags
    that indicate that the material has not been tagged (<NA> ...</NA>); it does
    not contain the XML tags for filename, title, subtitle, paragraph, and
    sentence, which were generated by the “Palavras” parser; it does not contain
    complementary tags for foreign words (EST), aposto (AP), data (DAD), telephone
    number (TEL), date (DAT) and time (HOR). Multiwords are separated; for
    example:

    a) the proper name, which in the research format is shown as
    “Rio=de=Janeiro_NPROP” has been separated into three parts, one in each line,
    with the same tags: “Rio_NPROP de_NPROP Janeiro_NPROP”;

    b) the prepositional phrase, which in the research format is shown as
    “apesar=de_PREP” hás been separated into two parts, one in each line, with the
    same tags: “apesar_PREP de_PREP”.
     

    These changes have increased the size of the corpus to 1.221.468 words.

    ------------

    Citando Thamar Solorio <thamy@inaoep.mx>:

    > Hi!
    > I've been searching for portuguese corpora annotated with Named
    > Entities. So far I've only found raw corpora and portals to portuguese
    > analyzers such as the one from the VISL project, but it is only for
    > online use and it does not provide NE classification.
    > So, if anyone knows of an available portuguese corpus tagged with NE
    > I'll appreciate if you let me know.
    >
    > Thanks!
    >
    > Thamar Solorio
    > Coord. Ciencias Computacionales
    > Instituto Nacional de Astrofísica, Óptica y Electrónica
    > Luis Enrique Erro #1, Tonantzintla, Puebla
    > México
    >
    > http://ccc.inaoep.mx/~thamy
    >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Wed Sep 08 2004 - 13:54:07 MET DST