The differences between annoation using SGML and annotation using XML
are fairly trivial, mostly relating to the "surface form" of the
annotation. More specifically
SGML XML
identifiers may be case-sensitive identifiers are always case-sensitive
end-tags are optional end-tags are mandatory
attribute values may be quoted attribute values must be quoted
empty tags are indistinguishable empty tags use a distinct syntax
from start-tags
So, in the general case, taking an SGML marked up corpus and turning
it into an XML one is far from rocket science, indeed it's barely
bicycle-science.
One possible difficulty in the XML approach is that it adds to the
verbosity of the markup. In the case of the BNC, for example, which
has 100 million words tagged like this
<w FOO>word <w BAR>another word
an XML version would have to look like this
<w type="FOO">word </w><w type="BAR">another word</w>
An overhead of at least 8 bytes per each of 100 million words increases
the size of the corpus by 800 Mb.
ON THE OTHER HAND (a) the extra markup doesn't have to be inserted or
maintained manually
(b) the data can be compressed for storage and uncompressed on the fly
(c) most compression algorithms are particularly effective where there
are short and high frequency tokens --as in this case
As an example, consider the BNC Sampler CD (for details see
http://info.ox.ac.uk/bnc/getting/sampler.html) -- of the four software
systems on the CD, one (Qwick) operates against a compressed XML
version of the corpus. See the Qwick home page at
http://www-clg.bham.ac.uk/QWICK/
If you're talking about converting an SGML document type definition to
an XML conformant one, however, things get a little more complex (but
not much). Some constructs are not allowed (e.g. use of inclusion
exceptions, some mixed content models) and some attribute datatypes
are not allowed.
If your DTD is based on the Text Encoding Initiative, you'll be glad
to know that *any* TEI conformant dtd can be automatically converted
to an XML version using the Pizza Chef software. See further
www.hcu.ox.ac.uk/TEI/pizza.html
HTH
Lou
|
|There is the Corpus Encoding Standard, which uses SGML DTD's to specify
|its formatting. Perhaps you could use these to form XML DTD's (which are
|somewhat simpler).
|
|the CES can be found at:
|http://www.cs.vassar.edu/CES/
|
|Converting an SGML DTD to XML:
|http://www.xml.com/xml/pub/98/07/dtd/
|
|Good luck,
|
|Arjen Poutsma
|
|
|
----------------------------------------------------------------
Lou Burnard http://users.ox.ac.uk/~lou
----------------------------------------------------------------