Re: Corpora: Corpora and XML

Lou Burnard (lou.burnard@computing-services.oxford.ac.uk)
Wed, 29 Sep 1999 12:13:48 +0100 (BST)

|
|On Tue, 28 Sep 1999, Kristina Kjellson wrote:
|
|> Is there anyone who has experience in annotating corpora using XML?

The differences between annoation using SGML and annotation using XML
are fairly trivial, mostly relating to the "surface form" of the
annotation. More specifically

SGML XML

identifiers may be case-sensitive identifiers are always case-sensitive

end-tags are optional end-tags are mandatory

attribute values may be quoted attribute values must be quoted

empty tags are indistinguishable empty tags use a distinct syntax
from start-tags

So, in the general case, taking an SGML marked up corpus and turning
it into an XML one is far from rocket science, indeed it's barely
bicycle-science.

One possible difficulty in the XML approach is that it adds to the
verbosity of the markup. In the case of the BNC, for example, which
has 100 million words tagged like this

<w FOO>word <w BAR>another word

an XML version would have to look like this

<w type="FOO">word </w><w type="BAR">another word</w>

An overhead of at least 8 bytes per each of 100 million words increases
the size of the corpus by 800 Mb.

ON THE OTHER HAND (a) the extra markup doesn't have to be inserted or
maintained manually

(b) the data can be compressed for storage and uncompressed on the fly

(c) most compression algorithms are particularly effective where there
are short and high frequency tokens --as in this case

As an example, consider the BNC Sampler CD (for details see
http://info.ox.ac.uk/bnc/getting/sampler.html) -- of the four software
systems on the CD, one (Qwick) operates against a compressed XML
version of the corpus. See the Qwick home page at
http://www-clg.bham.ac.uk/QWICK/

If you're talking about converting an SGML document type definition to
an XML conformant one, however, things get a little more complex (but
not much). Some constructs are not allowed (e.g. use of inclusion
exceptions, some mixed content models) and some attribute datatypes
are not allowed.

If your DTD is based on the Text Encoding Initiative, you'll be glad
to know that *any* TEI conformant dtd can be automatically converted
to an XML version using the Pizza Chef software. See further
www.hcu.ox.ac.uk/TEI/pizza.html

HTH

Lou

|
|There is the Corpus Encoding Standard, which uses SGML DTD's to specify
|its formatting. Perhaps you could use these to form XML DTD's (which are
|somewhat simpler).
|
|the CES can be found at:
|http://www.cs.vassar.edu/CES/
|
|Converting an SGML DTD to XML:
|http://www.xml.com/xml/pub/98/07/dtd/
|
|Good luck,
|
|Arjen Poutsma
|
|
|

----------------------------------------------------------------
Lou Burnard http://users.ox.ac.uk/~lou
----------------------------------------------------------------