Corpora: Summary: Catalan & Galician corpora

From: Alex Franz (alex@google.com)
Date: Sat May 19 2001 - 01:06:37 MET DST

  • Next message: Yuri Tambovtsev: "Corpora: offprints on phonostatistics"

    Thank you very much to the following people for their extremely
    helpful responses to my query about Catalan and Galician corpora:

    Ines Diz
    Linda Oxnard
    Shlomo Izre'el
    Joan Soler i Bou
    Claus Pusch
    Teresa Cabre
    Mary Taffet
    Lluis de Yzaguirre
    Jorge Vivaldi
    Lidia Lluis
    Araceli Alonso

    Below is a summary of the information that I received.

    --Alex

    ---
    The "Ramón Pińeiro Centre for Humanities Research" is developing a
    galician corpus. 
    

    You can find information at:

    http://www.cirp.es/WXN/wxn/frames/proxectos.html --- There are three universities which I know are doing corpus research in Catalonia, Universitat de Barcelona, Universitat Pompeu Fabra and Universitat Politecnica de Catalunya.

    You might want to take a look at the multilingual corpus which the UPF have put together (texts in Catalan, Spanish, French, English and German) in specialised areas (law, environment, medicine, economy and IT).

    http://www.iula.upf.es/corpus/corpus.htm

    Take a look at http://www.iula.upf.es/corpus/eticca.htm for the tools and demos which they have put on line.

    At http://nipadio.lsi.upc.es/cgi-bin/demo/demo.pl you will find some demos for corpus tools in Catalan, Spanish and English which the UPC have put on line.

    Also, the UB are working on a number of corpora (including an oral one of colloquial Catalan and a written one of contemporary Catalan). I'm not sure of the exact URL and their server seems to be down at the moment, but the group is called Lincat and they are at www.ub.es.

    Finally, you might be interested to know that there is an automatic language identifier which includes Catalan at http://odur.let.rug.nl/~vannoord/TextCat/Demo/ I have used this with reasonable success to do focused web crawling for Catalan pages. --- Take a look at http://www.uni-tuebingen.de/romanistik/zfk/oller.html --- The Institut d'Estudis Catalans (IEC) have developed a corpus of contemporary Catalan of 52 million words. The corpus is specially conceived as a reference corpus for dictionary-making, and this is in fact the internal use of the IEC. The corpus also accessible via internet in http://pdl.iec.es/. At the same address you can find documentation on the corpus structure. Keep in mind that the corpus browser limits the results to 100 instances for each consult. One of the firsts result of this corpus has been the publication of a frequency dictionary of Catalan, containing in paper and in electronic support the statistic information obtained from the lemmas of the corpus.

    The IEC has also the PAROLE Catalan corpus, a 21 million corpus developed within the PAROLE project, financed by the European Commission. --- I suppose that you are looking for contemporary language samples and not for historical texts.

    I am more aware of ressources for Catalan, so I will concentrate on this language.

    As far as web-based written texts (for non-linguistic purposes) are concerned, the best platform to start from is <http://www.lincaweb.es> (or maybe it's <http://www.lincaweb.com>). This gives thousands of links to Catalan web sites, official, commercial and private ones.

    Among the ressources you might find particularly useful are the online versions of Catalan newspapers and weekly magazines. These are also accessible from <http://www.lincaweb.es>, but here are some direct links (the newspapers come from different parts of Catalan-speaking Spain, most from Catalonia, one from the Valencian Land, one from the Balearic Islands):

    El Periódico: <http://www.elperiodico.com> Avui: <http://www.avui.com> Diari de Tarragona: <http://www.diaridetarragona.com> Diari de Girona: <http://www.diaridegirona.es> Diari El Segre <http://www.diarisegre.com> Diari de Balears: <http://www.diaridebalears.com> Regió7: <http://www.regio7.com> El Temps: <http://www.eltemps.com>

    A good platform to start from collecting a web corpus may also be the Catalan newsportal Partal at <http://www.partal.com> (or, again, it may be: <http://www.partal.es>). They have now two versions, one for Catalonia, the other for the Land of Valčncia.

    As far as official texts are concerned, a sample of the Official Bulletins of the Catalan regional government (Diari Oficial de la Generalitat de Catalunya) is available on-line (<http://www.gencat.es>, then choose link _DOGC_). All the issues of this Bulletin, containing mainly legal texts, are also available for purchase on CD-ROM. See <http://www.gencat.es/nov_edit/> (actually, this is the page for new books published by the Government's publication service, but there will be a link back to the catalogue). If I remember well, there are also CD-ROMs available with the minutes of the Parliament sessions but I have not seen this yet, and furthermore these CDs are quite costly.

    Now for academic corpora:

    The Institut d'Estudis Catalans has published, a couple of years ago, a hugh two-volume frequency dictionary for literary and non-literary language (one volume each), both mainly based on written sources. These books come with a CD-ROM but this contains only the frequency lists in differently consultable form, alas not the text ressources the lists are based on. But perhaps these are available or accessible directly at the Institut d'Estudis Catalans. You might go to <http://www.iec.es> and look for a mail link.

    The University of Barcelona (<http://www.ub.es>) is working on an excellent, but hitherto unpublished reference corpus for Catalan, "Corpus de Catalą Contemporani de la Universitat de Barcelona", including both written, semi-spontaneous and spontaneous oral texts. This corpus should be available on CD-ROM or on the web soon. You can read an introductory text to the oral corpus in the online version of the 'Zeitschrift für Katalanistik' at: <http://www.uni-tuebingen.de/romanistik/zfk.html> (then choose issue 13, Article "El COC del CUB"). You might also contact Emili Boix <boix@lincat.ub.es> or Nśria Alturo <alturo@lincat.ub.es> who are both working on this corpus.

    The University Pompeu Fabra of Barcelona (<http://www.upf.es>) and the Catalunya Rądio station are working on a hypermedia compilation of oral Catalan texts the primary goal of which will be the pronunciation training of radio speakers but which should also be made available for research and teaching purposes through the web. For this project, called DOPO, you might contact Oriol Camps of Catalunya Rądio <ocamps.n@catradio.com> or Lluķs de Yzaguirre of UPF <de_yza@upf.es>.

    The Institut d'Estudis Catalans, again, has collected in the 60's and 70's oral texts for a language atlas project, "Atles Lingüķstic del Domini Catalą", which are now being treated electronically. CD-ROM versions of both the recordings and their transcriptions are announced, but up to now only a selection of transcripts has been published in book form, together with an audio cassette containing the recordings. Please contact Mar Massanell of the Universitat Oberta de Catalunya <mmassanell@campus.uoc.es> to see when / if / where the electronic version of this (dialect) corpus is available.

    I know of some more spoken language corpora published in printed form, sometimes with the recordings on audio tape, but I do not know if this is of interest for you. As far as Galician is concerned, it might be useful to ask my colleague Johannes Kabatek at Tübingen university, who has published a corpus of spoken Galician (he certainly will give it to you in electronic form) and who is very well informed about Galician research and corpus projects; his mail is <kabatek@uni-tuebingen.de>. ---

    The most important Catalan corpus (60Mwords) is online at URL

    http://www.iec.es:120/

    You will find Catalan newspapers under TACTWEB in

    http://www.iula.upf.es/altres/evt/CECA1.htm

    (5Mwords in 125 single-day files). --- You will find information about a LSP corpus as well as NLP tools for catalan at the following url: http://www.iula.upf.es/corpus/corpusuk.htm Such corpus is being compiled at the Institute for Applied Linguistic at the Pompeu Fabra University in Barcelona. --- I send you some address where you can find some information about Galician Corpora whether in Galician or in Galician-Portuguese.

    http://www.uvigo.es/webs/h06/weba573/persoal/henr/recurs/bibl1.htm

    CORPUS DE REFERENCIA DO GALEGO ACTUAL http://www.cirp.es/WXD/wxd/prox/prxCorg1996.html

    Corpus documentale latinum Gallaeciae http://www.cirp.es/WXD/wxd/prox/prxCLat1998.html ---



    This archive was generated by hypermail 2b29 : Sat May 19 2001 - 14:28:43 MET DST