Let me see if I understand this. In essence this is one of the
warnings found in Drew McDermott's classic "Artificial Intelligence
meets Natural Stupidity". Merely calling something a
paragraph doesn't make it a paragraph. What you are saying is that you can
use the SGML <P> </P> markup in ways which are at variance with the
"common-sense" notion of paragraph. So you could mark up my current
favourite paragraph either reasonably
....
<P>
Bong! the stone hit the dog.
</P>
or unreasonably
<P>Bong</P>
<P>!</P>
<P> the </P>
<P>stone</P>
<P>hit</P>
<P>the</P>
<P>dog</P>
<P>.</P>
the former being preferred. In order to check for this sort of abuse, you
would (of course) need some independent program capable of validating the
contents of the <P> elements. This would in turn require that the people who
designed the annotation scheme have a sufficiently
precise notion about what ought to be true about paragraphs. It may be possible
to encode some of this notion into the DTD. It would be easy to say that
paragraphs cannot directly contain anything except sentences, and that
sentences
in turn contain words, preventing the abuse shown above. But in many cases the
idea in the mind of the corpus designer is more sophisticated than anything
which can comfortably be encoded in an SGML DTD. In this case you need some
other way of expressing the original intention (plausible
candidtates are predicate
logic, formal specification languages, clearly written English text which
your staff programmer can turn into executables, Perl programs ...).
I'd be interested in any programs which clearly demonstrate the need to
check something which goes beyond what is conveniently expressible in
DTDs, and in how they choose to do it.
Chris
Email: Chris.Brew@edinburgh.ac.uk
Address: Language Technology Group, HCRC,
2 Buccleuch Place, Edinburgh EH8 9LW,Scotland
Telephone: +44 131 650 4632 Fax: +44 131 650 4587