RE: [Corpora-List] unencumbered corpora

From: Santos Diana (Diana.Santos@sintef.no)
Date: Sun Jan 23 2005 - 18:55:49 MET

Next message: Santos Diana: "RE: [Corpora-List] unencumbered corpora"

Previous message: Przemek Kaszubski: "Re: [Corpora-List] My semantic prosody questionnaire"
Maybe in reply to: Lou Burnard: "[Corpora-List] unencumbered corpora"
Next in thread: Santos Diana: "RE: [Corpora-List] unencumbered corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear Lou,
Linguateca has been producing several "unemcumbered corpora" of Portuguese since its launching, in 2000.

Of course I don't know what a "major European language" is, and would like to have some definition (or exhaustive listing, whatever is easiest for you), but in case there are other people who may be interested in (richly) annotated Portuguese corpora, here is the information:

Thanks to Eckhard Bick's collaboration and his PALAVRAS parser, and to the data providers (the newspapers Público and Folha de São Paulo), we have created:

- Floresta Sintá(c)tica, a treebank of (currently) 168,000 words (which has both dependency and phrase structure information), human revised and in several formats (there's even XML)
- Floresta Virgem - 2 million-words, the above material unrevised
- automatically annotated CETENFolha (34 million-words, Brazilian Portuguese)
- automatically annotated CETEMPúblico (180 million-words, Portuguese from Portugal)

All these data are available for research and development, by industrial and academic institutions, free of charge. See www.linguateca.pt <http://www.linguateca.pt> , choose "Acesso a recursos" and then "This page in English" (I presume...)

Best regards
Diana
============
Diana Santos
www.linguateca.pt
Linguateca, Pólo de Oslo, SINTEF ICT
Pb 124 Blindern, N-0314 Oslo, Noruega

________________________________

De: owner-corpora@lists.uib.no em nome de Lou Burnard
Enviada: sex 21-01-2005 19:55
Para: corpora@hd.uib.no
Assunto: [Corpora-List] unencumbered corpora

Can anyone point me to any annotated language corpora which are freely
available under something like the GNU Public Licence? All the ones I
have thought of so far seem to be available only under some kind of
complicated licensing scheme which precludes (e.g) commercial
exploitation, unrestricted copying, etc. And cost money.

I'd like to have a corpus of a reasonable size (1 million+ words) in any
European language (tho English or French are preferable) with some
kind of word-level annotation, which I can hack about, use in teaching,
and put on a freely-distributable CD, without worrying about copyright
lawyers. There *must* be some somewhere!

It doesn't even have to be in XML -- though it will be when I've
finished with it.

Lou Burnard

Next message: Santos Diana: "RE: [Corpora-List] unencumbered corpora"
Previous message: Przemek Kaszubski: "Re: [Corpora-List] My semantic prosody questionnaire"
Maybe in reply to: Lou Burnard: "[Corpora-List] unencumbered corpora"
Next in thread: Santos Diana: "RE: [Corpora-List] unencumbered corpora"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Jan 23 2005 - 19:01:51 MET