[Corpora-List] Project Lácio-Web --- Second Release

From: Sandra Maria Aluísio (sandra@icmc.usp.br)
Date: Tue Jun 29 2004 - 22:16:51 MET DST

  • Next message: Wojciech Skut: "[Corpora-List] Text-to-speech job openings at Rhetorical Systems, Edinburgh"

    Dear colleagues

    We are pleased to announce the second release of the Lácio-Web webpage. Lácio-Web is a project aimed at providing corpora for Brazilian Portuguese and software tools for computational linguistic processing.

     

    As a result of the first release, launched in January 20th, two corpora were made available:

    - a version of the Lácio-Ref (a reference corpus with 4,156,816 words) constituted of five genres of texts (informative, scientific, prose, poetry and drama), for research and building of subcorpora, and

    - the MAC-MORPHO, a POS annotated corpus with 1,167,183 words, from the newspaper Folha de São Paulo, 1994.

     

    For the second release, Lácio-Ref has been enhanced with texts from the following genres: legal, scientific, informative and instructional. The Lácio-Ref Corpus consists of 4,278 files with 8,291,818 words at the time of its second release.

    A parallel corpus Par-C has also been made available with 646 text files in English and 646 in Portuguese from the Revista Pesquisa Fapesp. The total number of words in the parallel corpus is 893,283.

    Apart from these corpora, a tool to build English-Portuguese comparable corpora for the legal genre has also been made available. For that purpose, a reference corpus with English texts (Ref-Ig) has been compiled for that domain. It contains 29 texts with a total of 61,149 words, and will be enlarged in the future.

     

    All in all, Lácio-Web contains 5,708 files with a total of 10,413,524 words.

     

    The project also makes available several computational linguistic tools such as frequency counters, concordancers and three POS taggers trained with the MAC-Morpho corpus: MXPOST, TreeTagger and Brill TBL.

     

    These new facilities are available from the project webpage:

    http://www.nilc.icmc.usp.br/lacioweb

     

    Cordially,

     

    Lácio-Web Team

     



    This archive was generated by hypermail 2b29 : Tue Jun 29 2004 - 22:48:48 MET DST