Re: Corpora: Corpus markup checking programs

Chris Brew (Chris.Brew@edinburgh.ac.uk)
Thu, 6 Nov 1997 11:27:11 GMT

>eiaamme@msmail.lancs.ac.uk said:
>] but rather programs which check, say, whether DTDs have been adhered
>] to, or check that SGML has been properly applied to a document.
>
>It is what we call an SGML parser. The more famous one is nsgmls coming with
>the James Clark package SP (available at http://www.jclark.com/).
>
>But what you mentioned will not check the semantic integrity of corpus
>encoding. For example, yuo can put every thing you want within a <P> (let say
>a paragraph) and even if your data is not a paragraph while the SGML
>syntax is
>correct ! As i know, there is no tool or software to check that level of
>integrity.

Let me see if I understand this. In essence this is one of the
warnings found in Drew McDermott's classic "Artificial Intelligence
meets Natural Stupidity". Merely calling something a
paragraph doesn't make it a paragraph. What you are saying is that you can
use the SGML <P> </P> markup in ways which are at variance with the
"common-sense" notion of paragraph. So you could mark up my current
favourite paragraph either reasonably

....
<P>
Bong! the stone hit the dog.
</P>

or unreasonably

<P>Bong</P>
<P>!</P>
<P> the </P>
<P>stone</P>
<P>hit</P>
<P>the</P>
<P>dog</P>
<P>.</P>

the former being preferred. In order to check for this sort of abuse, you
would (of course) need some independent program capable of validating the
contents of the <P> elements. This would in turn require that the people who
designed the annotation scheme have a sufficiently
precise notion about what ought to be true about paragraphs. It may be possible
to encode some of this notion into the DTD. It would be easy to say that
paragraphs cannot directly contain anything except sentences, and that
sentences
in turn contain words, preventing the abuse shown above. But in many cases the
idea in the mind of the corpus designer is more sophisticated than anything
which can comfortably be encoded in an SGML DTD. In this case you need some
other way of expressing the original intention (plausible
candidtates are predicate
logic, formal specification languages, clearly written English text which
your staff programmer can turn into executables, Perl programs ...).

I'd be interested in any programs which clearly demonstrate the need to
check something which goes beyond what is conveniently expressible in
DTDs, and in how they choose to do it.

Chris

Email: Chris.Brew@edinburgh.ac.uk
Address: Language Technology Group, HCRC,
2 Buccleuch Place, Edinburgh EH8 9LW,Scotland
Telephone: +44 131 650 4632 Fax: +44 131 650 4587