[Corpora-List] OPUS v0.2 is available

From: Jörg Tiedemann (joerg@stp.ling.uu.se)
Date: Sat Jul 12 2003 - 12:06:13 MET DST

  • Next message: Yuri Tambovtsev: "[Corpora-List] corpora and new language classifications"

    OPUS is an open source parallel corpus which is available from
    http://logos.uio.no/opus/

    Version 0.2 of the corpus contains roughly 30 million tokens
    in 60 languages. OPUS is sentence aligned (1830 language pairs),
    tokenized, and partly tagged.
     
    The following subcorpora are included:
       OpenOffice.org ca 2,5 million words 6 languages
       PHP manuals ca 3,2 million words 21 languages
       KDE messages ca 20,5 million words 60 languages
       KDE manuals ca 3,8 million words 24 languages

    More information can be found on the OPUS home page.

                          ---------------------------
                          Jörg Tiedemann (http://stp.ling.uu.se/~joerg/)
                          Lars Nygaard (http://folk.uio.no/larsnyg/)

    =======================================================================

    The following tools have been used (not including standard GNU-tools):

    * align - sentence aligner (based on Gale&Church, 1993)
    * OpenNLP & Grok, Jason Baldridge and Gann Bierner
      http://grok.sourceforge.net/
    * TnT - Statistical Part-of-Speech Tagging, Thorsten Brants
      http://www.coli.uni-sb.de/~thorsten/tnt/
    * TreeTagger - Decision Tree Tagger, Helmut Schmid
      http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
    * ChaSen - japanese tokenizer + tagger
      http://chasen.aist-nara.ac.jp/
    * recode - convert between various character encodings
      (http://www.iro.umontreal.ca/contrib/recode/HTML/)
    * tidy - validate, correct, and pretty-print XML-files
      (http://www.w3.org/People/Raggett/tidy/)
    * Uplug - tokenizer, sentence-splitter, XML-tools
      http://stp.ling.uu.se/plug/



    This archive was generated by hypermail 2b29 : Sat Jul 12 2003 - 12:10:19 MET DST