Re: Corpora: Corpora and XML

Ted E. Dunning (ted@hncais.com)
Wed, 29 Sep 1999 11:54:26 -0700 (PDT)

It should be noted that Lou was very careful here to limit what he
said. The differences between using SGML and XML for text annotation
purposes are relatively trivial as he says. The differences between
SGML and XML in general are vast and have mostly to do with the
massive, unnecessary and unwieldy generality of SGML. The meaning of
the <> characters themselves can be redefined using SGML as can
virtually everything else. I have heard it said that there still is
no parser that handles all of SGML even after all these years. XML,
on the other hand, is much less fluid; virtually all of the truly
complex features of SGML have been removed. The result that
syntactically valid XML can always be parsed even without a DTD.

One difference between SGML and XML which does make a substantial
difference for text annotation purposes, however, is the loss of
concurrent markup. For instance, in SGML, it would be possible to
annotate a hierarchical physical structure of book/signature/page as
well as a logical structure of chapter/section/paragraph. Clearly,
the physical and logical structures only coincide in some respects.
For instance, chapters will often start on page boundaries, but
paragraphs cannot do this. There are workarounds, but XML basically
cannot do this task without some serious cleverness or loss of
utility.

>>>>> "lb" == Lou Burnard <lou.burnard@computing-services.oxford.ac.uk> writes:

lb> The differences between annoation using SGML and annotation
lb> using XML are fairly trivial, mostly relating to the "surface
lb> form" of the annotation.