RE: Corpora: Corpora and XML

Burnard Towers (lou.burnard@computing-services.oxford.ac.uk)
Wed, 29 Sep 1999 23:41:06 +0100

At the risk of prolonging a discussion which may be of only peripheral
interest to most corpora readers, I think Andrew may be confusing the issue
of how easily an SGML dtd can be converted to an XML one with the issue of
how easily a (valid) SGML document can be converted to a valid XML one. In
my posting I intended to make clear that the latter was simple, not the
former.

ANY valid SGML document is ipso facto a well-formed XML document (as Ted
Dunning almost pointed out), or can readily be turned into one using a
variety of free software (sx from James Clark for example) or the simple
rules of thumb I outlined. Whether it is also a *valid* XML document cannot
be determined without creating an XML dtd to validate it against, and this
is certainly not a readily automatable process. (Unless you are using a TEI
dtd, of course)

Apologies for any confusion...

Lou

At 14:20 29/09/99 +0200, you wrote:
>Sorry to be picky, but I would not be *so* blase about the ease with which
>SGML can be converted to XML. In principle Lou is almost right, there is not
>so much difference. In practice it depends very much on your SGML DTD.
>
>Admittedly, it is easy enough to automatically change any optional end-tag
>to an obligatory one (but then someone has to re-validate all the document
>instances!), but there is more to than this.
>
>For instance, there are some items which a bit risky to do automatically
>(e.g. replacing CDATA with #PCDATA, etc.).
>
>Furthermore, there are some things which you might have in your DTD which
>are simply not supported in XML, such as inclusions and exclusions, and the
>dreaded AND connector ("these elements in any order").These are the kind of
>thing which made processing SGML such a nightmare and prompted the
>development of XML, and it is these characteristics which make the automatic
>translation equally fraught with difficulty.
>
>The short answer then is that SGML->XML is only as easy as processing the
>original SGML DTD in the first place (this might be intractable :-)), and
>even once the DTD is converted, the cost of validating all the legacy
>instances against the DTD should not be underestimated either...
>
>Cheers,
>Andrew
>
>> -----Original Message-----
>> From: owner-corpora@lists.uib.no [mailto:owner-corpora@lists.uib.no]On
>> Behalf Of Lou Burnard
>> Sent: 29 September 1999 13:14
>> To: corpora@hd.uib.no
>> Subject: Re: Corpora: Corpora and XML
>>
>>
>> |
>> |On Tue, 28 Sep 1999, Kristina Kjellson wrote:
>> |
>> |> Is there anyone who has experience in annotating corpora using XML?
>>
>> The differences between annoation using SGML and annotation using XML
>> are fairly trivial, mostly relating to the "surface form" of the
>> annotation. More specifically
>>
>> SGML XML
>>
>> identifiers may be case-sensitive identifiers are always case-sensitive
>>
>> end-tags are optional end-tags are mandatory
>>
>> attribute values may be quoted attribute values must be quoted
>>
>> empty tags are indistinguishable empty tags use a distinct syntax
>> from start-tags
>>
>> So, in the general case, taking an SGML marked up corpus and turning
>> it into an XML one is far from rocket science, indeed it's barely
>> bicycle-science.
>>
>> One possible difficulty in the XML approach is that it adds to the
>> verbosity of the markup. In the case of the BNC, for example, which
>> has 100 million words tagged like this
>>
>> <w FOO>word <w BAR>another word
>>
>> an XML version would have to look like this
>>
>> <w type="FOO">word </w><w type="BAR">another word</w>
>>
>> An overhead of at least 8 bytes per each of 100 million words increases
>> the size of the corpus by 800 Mb.
>>
>> ON THE OTHER HAND (a) the extra markup doesn't have to be inserted or
>> maintained manually
>>
>> (b) the data can be compressed for storage and uncompressed on the fly
>>
>> (c) most compression algorithms are particularly effective where there
>> are short and high frequency tokens --as in this case
>>
>>
>> As an example, consider the BNC Sampler CD (for details see
>> http://info.ox.ac.uk/bnc/getting/sampler.html) -- of the four software
>> systems on the CD, one (Qwick) operates against a compressed XML
>> version of the corpus. See the Qwick home page at
>> http://www-clg.bham.ac.uk/QWICK/
>>
>>
>> If you're talking about converting an SGML document type definition to
>> an XML conformant one, however, things get a little more complex (but
>> not much). Some constructs are not allowed (e.g. use of inclusion
>> exceptions, some mixed content models) and some attribute datatypes
>> are not allowed.
>>
>> If your DTD is based on the Text Encoding Initiative, you'll be glad
>> to know that *any* TEI conformant dtd can be automatically converted
>> to an XML version using the Pizza Chef software. See further
>> www.hcu.ox.ac.uk/TEI/pizza.html
>>
>> HTH
>>
>> Lou
>>
>> |
>> |There is the Corpus Encoding Standard, which uses SGML DTD's to specify
>> |its formatting. Perhaps you could use these to form XML DTD's (which are
>> |somewhat simpler).
>> |
>> |the CES can be found at:
>> |http://www.cs.vassar.edu/CES/
>> |
>> |Converting an SGML DTD to XML:
>> |http://www.xml.com/xml/pub/98/07/dtd/
>> |
>> |Good luck,
>> |
>> |Arjen Poutsma
>> |
>> |
>> |
>>
>> ----------------------------------------------------------------
>> Lou Burnard http://users.ox.ac.uk/~lou
>> ----------------------------------------------------------------
>>
>>
>>
>
>
>
>
>
----------------------------------