At the other extreme, we collected over a gigabyte of speech recordings
by non-native learners of English, annotated with graphemic, phonemic,
prosodic and "error" markup; this required a special editing tool
(provided by Entropic) and the file format was determined by this tool, to
allow parallel versions of each utterance and its annotations to be stored
in separate parallel files, and be displayed and edited aligned onscreen.
If you have in mind a general-purpose text corpus representative of
a language (or national-regional dialiect) then I would suggest
copying conventions used in British National Corpus project, or else
EU PAROLE project - then any software which has been (or will be) developed
for these will also be applicable to your corpus.
Eric
Eric Atwell, Distributed Multimedia Systems MSc Tutor, SOCRATES Coordinator,
and Director, Centre for Computer Analysis of Language And Speech (CCALAS)
School of Computer Studies, University of Leeds, LEEDS LS2 9JT, England
EMAIL: eric@scs.leeds.ac.uk TEL: (44)113-2335430 FAX: (44)113-2335468
WWW: http://www.scs.leeds.ac.uk/scs/public/staff/eric.html