Re: Corpora: Software for Corpus storing

eric@scs.leeds.ac.uk
Mon, 6 Sep 1999 10:15:19 +0100

Rudolf,
I would say it depends to at least some extent on what use(s) you had in mind.
We collected a c40million word corpus of News messages for Natural
Language Learning experiments; for this, speed of extraction of
huge numbers of word-patterns was important, and retreiving source info
was not, so we stored as (a number of) plain ascii text files with no
markup (other than newline every c80 characters) to conform with basic unix
textfile format and tools.

At the other extreme, we collected over a gigabyte of speech recordings
by non-native learners of English, annotated with graphemic, phonemic,
prosodic and "error" markup; this required a special editing tool
(provided by Entropic) and the file format was determined by this tool, to
allow parallel versions of each utterance and its annotations to be stored
in separate parallel files, and be displayed and edited aligned onscreen.

If you have in mind a general-purpose text corpus representative of
a language (or national-regional dialiect) then I would suggest
copying conventions used in British National Corpus project, or else
EU PAROLE project - then any software which has been (or will be) developed
for these will also be applicable to your corpus.

Eric

Eric Atwell, Distributed Multimedia Systems MSc Tutor, SOCRATES Coordinator,
and Director, Centre for Computer Analysis of Language And Speech (CCALAS)
School of Computer Studies, University of Leeds, LEEDS LS2 9JT, England
EMAIL: eric@scs.leeds.ac.uk TEL: (44)113-2335430 FAX: (44)113-2335468
WWW: http://www.scs.leeds.ac.uk/scs/public/staff/eric.html