[Corpora-List] Lácio-Web Project --- First release

From: Sandra Maria Aluísio (sandra@icmc.usp.br)
Date: Tue Jan 20 2004 - 21:30:02 MET

  • Next message: Sylvain Loiseau: "[Corpora-List] Call for Papers: ColDoc'04, extended deadline"

    Dear colleagues

    We are pleased to announce the first release of the Lácio-Web webpage, aimed at providing corpora for Brazilian Portuguese and software tools for computational linguistic processing.

    Six corpora will be available at the end of the Lácio-Web Project in May, 2004. In this first release, two corpora are made available: one version of Lácio-Ref for research and generation of subcorpora and MAC-Morpho for download. For the download of the first public release, please visit the webpage at

    http://www.nilc.icmc.usp.br/lacioweb

    Further details of the 2 corpora being released are given below. General information is given in the webpage above:

     

    Lácio-Ref

    This version of the reference corpus has 4,156,816 words, comprising texts from five genres (news, scientific, prose, poetry and drama), several types of text (such as reports, papers, chronicles, letters), various domains (such as education, engineering, politics) and different media (magazines, Internet pages, books). Lácio-Ref is available for research with generation of subcorpora for download in 2 formats: one with headings in XML, with bibliographic data, and another with title, subtitles, authorship and the plain text.

    MAC-Morpho

    MAC-Morpho has 1,167,183 words from the newspaper Folha de São Paulo, 1994. It has been tagged with the Palavras parser by Eckhard Bick (http://visl.hum.sdu.dk) and mapped to the tagset of the Lácio-Web project. The morphosyntactic tags have been manually revised. MAC-MORPHO is available for download in 2 formats:

    1) for linguistic research with frequency counters and concordancers, for example.

    2) for training taggers, as it allows the tagset to be altered. For instance, some sub- specification of the tags has been removed and multiword items were separated. These changes increased the size of the corpus to 1,221,468 words.

    Lácio-Web Project will also make available computational linguistics tools. In this first release we have frequency counters and concordancers in order to allow users to get a quick view of the subcorpora generated. New tools, such as morphosyntactic taggers, will be made available in the future.

    Cordially,

     

    Lácio-Web Team



    This archive was generated by hypermail 2b29 : Tue Jan 20 2004 - 21:59:29 MET