Re: compiling a corpus (of historical Romance language texts)

Mark Davies (mdavies@rs6000.cmp.ilstu.edu)
Thu, 25 Jan 1996 23:00:54 -0600

At 12:59 PM 1/25/96 +0000, you wrote:
>I would like to receive information regarding technical/software tools
>useful for the compilation of a corpus (these are Italian medieval texts
>to be used for research in diacronic syntax).

I'm sure that others will respond with some good bibliographical references
on how to put together a corpus, but I thought I'd just add my .02, based on
my personal experience.

I've put together a 5,500,000 word corpus of historical Spanish (1200-1900),
and a 1,650,000 word corpus of historical Portuguese (1300-1700), both of
which I believe are the most extensive ones in existence (I've also put
together more than 10,000,000 words of Modern Spanish, 2,00,000 words of
Modern Portuguese, 1,200,000 words of Medieval Spanish Bibles with the
accompanying Hebrew and Latin source texts, and a parallel German/ Old
English / (soon Middle English) / Modern English corpus of the Gospel of Luke.

Regarding the historical Spanish and Portuguese texts, I scanned in all of
the Portuguese texts by hand (no fun), as well as all of the post-1500
Spanish texts and some of the pre-1500 Spanish texts (the others are from
the ADMYTE CD-ROM, Vol. 0). I then used _WordCruncher_ to create an
every-word index, and am now able to use WordCruncher to perform proximity
and Boolean searches on the data.

It is not a "tagged" corpus (where each word is assigned a part of speech,
etc), but luckily in the Romance languages, it's possible to have
WordCruncher create a file containing all of the infinitives or all of the
forms of certain verbs, which can then be used in subsequent studies. Most
of my work has focused on infinitival constructions, which can be easily
identified. Doing research on DET+N constructions or ADV placement, or
something like that, wouldn't be much fun without a tagged corpus.
Unfortunately, since you are going to find so much variation in forms in the
older stages of a language, I'm not sure how a "tagger" would do it anyway.

If you'd like more info, you might want to consult an article I had
published in _La Coronica_ (Spring 1995) that summarizes the steps I went
through in putting the corpus together. Also, please feel free to email me
or to visit my WWW page that summarizes some of the corpora that I've put
together, and some of the articles I've published that are based on these
corpora (http://www.ilstu.edu/~mdavies/corpora.htm).

==================================================================
Mark Davies, Assistant Professor, Spanish Linguistics
Dept. of Foreign Languages, Illinois State University
Normal, IL 61790-4300

Voice:309/438-7975 email:mdavies@ilstu.edu
Fax:309/438-8038 http://www.ilstu.edu/~mdavies/welcome.htm
==================================================================