Re: Corpora: Corpora and XML

Ted E. Dunning (ted@hncais.com)
Tue, 28 Sep 1999 18:25:31 -0700 (PDT)

I have not annotated corpora using XML, but I do have experience
encoding other sorts of data in XML. Using XML will be helpful to you
in that it can be rendered using a number of different style sheets so
that various aspects of texts can be made apparent and because parsing
XML is soooo much easier than parsing SGML (you can do it even without
the DTD).

One downside of XML is that the markup tends to be voluminous. In our
XML-encoded transactions, we find that compression of 10x is possible
using nothing but gzip. Without the XML, our transactions only
compress about 2x, so the overhead of XML is considerable.

This overhead will probably not mean that much to you in terms of disk
space since compression is so easy. Where it will hurt you is in
terms of I/O overhead and processing time. Many corpus analysis
applications are I/O bound and increasing the file size by using XML
will probably slow things down considerably. This may well not matter
to you.

If this slowdown is, in fact, a real issue in your application there
are ways around the problem. The simplest workaround is to store the
raw text without the annotations and keep the annotations separately
in a compact format which refers back to the raw text. Data stored
this way can readily be converted back to XML as needed, assuming it
maintains the logical constraints of the related DTD.

I can provide more details if necessary.

>>>>> "ide" == Nancy Ide <ide@loria.fr> writes:

ide> Kristina Kjellson wrote:

>> Is there anyone who has experience in annotating corpora using
>> XML?
>>

ide> You should look at the Corpus Encoding Standard (CES) at

ide> http://www.cs.vassar.edu/CES/

ide> The specification is currently in SGML but we are in the
ide> process of changing it to XML.