RE: Corpora: Corpora and XML

Andrew Bredenkamp (andrewb@dfki.de)
Wed, 29 Sep 1999 14:20:57 +0200

Sorry to be picky, but I would not be *so* blase about the ease with which
SGML can be converted to XML. In principle Lou is almost right, there is not
so much difference. In practice it depends very much on your SGML DTD.

Admittedly, it is easy enough to automatically change any optional end-tag
to an obligatory one (but then someone has to re-validate all the document
instances!), but there is more to than this.

For instance, there are some items which a bit risky to do automatically
(e.g. replacing CDATA with #PCDATA, etc.).

Furthermore, there are some things which you might have in your DTD which
are simply not supported in XML, such as inclusions and exclusions, and the
dreaded AND connector ("these elements in any order").These are the kind of
thing which made processing SGML such a nightmare and prompted the
development of XML, and it is these characteristics which make the automatic
translation equally fraught with difficulty.

The short answer then is that SGML->XML is only as easy as processing the
original SGML DTD in the first place (this might be intractable :-)), and
even once the DTD is converted, the cost of validating all the legacy
instances against the DTD should not be underestimated either...

Cheers,
Andrew

> -----Original Message-----
> From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]On
> Behalf Of Lou Burnard
> Sent: 29 September 1999 13:14
> To: corpora@hd.uib.no
> Subject: Re: Corpora: Corpora and XML
>
>
> |
> |On Tue, 28 Sep 1999, Kristina Kjellson wrote:
> |
> |> Is there anyone who has experience in annotating corpora using XML?
>
> The differences between annoation using SGML and annotation using XML
> are fairly trivial, mostly relating to the "surface form" of the
> annotation. More specifically
>
> SGML XML
>
> identifiers may be case-sensitive identifiers are always case-sensitive
>
> end-tags are optional end-tags are mandatory
>
> attribute values may be quoted attribute values must be quoted
>
> empty tags are indistinguishable empty tags use a distinct syntax
> from start-tags
>
> So, in the general case, taking an SGML marked up corpus and turning
> it into an XML one is far from rocket science, indeed it's barely
> bicycle-science.
>
> One possible difficulty in the XML approach is that it adds to the
> verbosity of the markup. In the case of the BNC, for example, which
> has 100 million words tagged like this
>
> <w FOO>word <w BAR>another word
>
> an XML version would have to look like this
>
> <w type="FOO">word </w><w type="BAR">another word</w>
>
> An overhead of at least 8 bytes per each of 100 million words increases
> the size of the corpus by 800 Mb.
>
> ON THE OTHER HAND (a) the extra markup doesn't have to be inserted or
> maintained manually
>
> (b) the data can be compressed for storage and uncompressed on the fly
>
> (c) most compression algorithms are particularly effective where there
> are short and high frequency tokens --as in this case
>
>
> As an example, consider the BNC Sampler CD (for details see
> http://info.ox.ac.uk/bnc/getting/sampler.html) -- of the four software
> systems on the CD, one (Qwick) operates against a compressed XML
> version of the corpus. See the Qwick home page at
> http://www-clg.bham.ac.uk/QWICK/
>
>
> If you're talking about converting an SGML document type definition to
> an XML conformant one, however, things get a little more complex (but
> not much). Some constructs are not allowed (e.g. use of inclusion
> exceptions, some mixed content models) and some attribute datatypes
> are not allowed.
>
> If your DTD is based on the Text Encoding Initiative, you'll be glad
> to know that *any* TEI conformant dtd can be automatically converted
> to an XML version using the Pizza Chef software. See further
> www.hcu.ox.ac.uk/TEI/pizza.html
>
> HTH
>
> Lou
>
> |
> |There is the Corpus Encoding Standard, which uses SGML DTD's to specify
> |its formatting. Perhaps you could use these to form XML DTD's (which are
> |somewhat simpler).
> |
> |the CES can be found at:
> |http://www.cs.vassar.edu/CES/
> |
> |Converting an SGML DTD to XML:
> |http://www.xml.com/xml/pub/98/07/dtd/
> |
> |Good luck,
> |
> |Arjen Poutsma
> |
> |
> |
>
> ----------------------------------------------------------------
> Lou Burnard http://users.ox.ac.uk/~lou
> ----------------------------------------------------------------
>
>
>