Re: Corpora: Corpora and XML

Ted E. Dunning (ted@hncais.com)
Wed, 29 Sep 1999 16:04:10 -0700 (PDT)

I beg to differ.

I believe that this is a valid SGML document (given an appropriate DTD
with I don't provide, but can be surmised). I have provided
end-markers in order to be very explicit about the non-nested nature
of concurrent markup.

<book>
<frontspiece>Who and when</frontspiece>
<page n=1>
<p>
This is the first paragraph on the first page.
</p>
<p>
The second paragraph extends
</page>
<page n=2>
onto the second page.
</p>
<p>
The third paragraph is entirely on the second page.
</p>
</page>
<endmatter>Index and stuff goes here</endmatter>
</book>

Here we have concurrent markup which has two concurrent, well-nested
structures (shown here in XML form):

<book><frontspiece/><page/><page/><endmatter/></book>

and

<book><p/><p/><p/></book>

Now, my SGML is a bit rusty (it started out that way, I should hasten
to add), but I am pretty sure that this structure can be represented
in SGML and cannot be directly converted to XML. What you *can* do is
cheat by using empty elements to mark page boundaries. By doing this,
you lose the syntactic guarantees about where pages must start and
end.

<book>
<frontspiece>Who and when</frontspiece>
<pagestart n=1/>
<p>
This is the first paragraph on the first page.
</p>
<p>
The second paragraph extends
<pageend n=1/>
<pagestart n=2/>
onto the second page.
</p>
<p>
The third paragraph is entirely on the second page.
</p>
<pageend n=2/>
<endmatter>Index and stuff goes here</endmatter>
</book>

Expressed as a regular grammar, this document structure would look
like this:

book ::= frontspiece? ( pagestart | pageend | p )* endmatter
frontspiece ::= PCDATA
pagestart ::= EMPTY
pageend ::= EMPTY
p ::= PCDATA

This sort of structure can also be encoded as a byzantine set of links
between multiple documents, but the less said about that option, the
better. You still lose many of the syntactic guarantees of SGML.

>>>>> "lb" == Burnard Towers <lou.burnard@computing-services.oxford.ac.uk> writes:

lb> ANY valid SGML document is ipso facto a well-formed XML
lb> document (as Ted Dunning almost pointed out), or can readily
lb> be turned into one using a variety of free software (sx from
lb> James Clark for example) or the simple rules of thumb I
lb> outlined.