Corpora: BOUNCE corpora@lists.uib.no: Non-member submission from

PostMaster UiB No (owner-corpora@lists.uib.no)
Thu, 6 Nov 1997 12:27:35 +0100

You have sent a message to the Corpora discussion list.

To reduce the possibility of mail spamming, only messages from members are
automatically sent to the list.

You will get this message either

1) because you are not a member of the list

You can send a message to majordomo@uib.no with this line in the body:

subscribe corpora

if you want to be a member (all subscription requests are inspected
and approved manually and therefore can be delayed from hours to days).

or

2) because your current address is different from the address you had when
you subscribed (even how small this difference is).

Send a message to corpora-request@hd.uib.no if you want to change your
address (and if possible give any of your old e-mail addresses).

Your current message is manually inspected and will be forwarded to the
list (after a delay of some hours or days) if the contents is appropriate
for the Corpora list.

Best regards
Knut Hofland, Listadm. Corpora (Knut.Hofland@hd.uib.no)

Your message was:

Received: from stevenson144.cogsci.ed.ac.uk (actually 129.215.144.1)
by noralf.uib.no with SMTP (PP); Thu, 6 Nov 1997 12:27:17 +0100
Received: from [129.215.110.167] (mac-chrisbr [129.215.110.167])
by stevenson.cogsci.ed.ac.uk (8.8.5/8.8.5) with ESMTP id LAA28450;
Thu, 6 Nov 1997 11:27:11 GMT
Date: Thu, 6 Nov 1997 11:27:11 GMT
X-Sender: chrisbr@mail.cogsci.ed.ac.uk
Message-Id: <l03102800b087548a20ec@[129.215.110.167]>
In-Reply-To: <199711060929.KAA19565@chimay.loria.fr>
References: Your message of "Wed,
05 Nov 1997 10:31:28 GMT." <199711051037.LAA13160@nora.hd.uib.no>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
To: Patrice Bonhomme <Patrice.Bonhomme@loria.fr>
From: Chris Brew <Chris.Brew@edinburgh.ac.uk>
Subject: Re: Corpora: Corpus markup checking programs
Cc: corpora@uib.no, chrisbr@cogsci.ed.ac.uk

>eiaamme@msmail.lancs.ac.uk said:
>] but rather programs which check, say, whether DTDs have been adhered
>] to, or check that SGML has been properly applied to a document.
>
>It is what we call an SGML parser. The more famous one is nsgmls coming with
>the James Clark package SP (available at http://www.jclark.com/).
>
>But what you mentioned will not check the semantic integrity of corpus
>encoding. For example, yuo can put every thing you want within a <P> (let say
>a paragraph) and even if your data is not a paragraph while the SGML
>syntax is
>correct ! As i know, there is no tool or software to check that level of
>integrity.

Let me see if I understand this. In essence this is one of the
warnings found in Drew McDermott's classic "Artificial Intelligence
meets Natural Stupidity". Merely calling something a
paragraph doesn't make it a paragraph. What you are saying is that you can
use the SGML <P> </P> markup in ways which are at variance with the
"common-sense" notion of paragraph. So you could mark up my current
favourite paragraph either reasonably

....
<P>
Bong! the stone hit the dog.
</P>

or unreasonably

<P>Bong</P>
<P>!</P>
<P> the </P>
<P>stone</P>
<P>hit</P>
<P>the</P>
<P>dog</P>
<P>.</P>

the former being preferred. In order to check for this sort of abuse, you
would (of course) need some independent program capable of validating the
contents of the <P> elements. This would in turn require that the people who
designed the annotation scheme have a sufficiently
precise notion about what ought to be true about paragraphs. It may be possible
to encode some of this notion into the DTD. It would be easy to say that
paragraphs cannot directly contain anything except sentences, and that
sentences
in turn contain words, preventing the abuse shown above. But in many cases the
idea in the mind of the corpus designer is more sophisticated than anything
which can comfortably be encoded in an SGML DTD. In this case you need some
other way of expressing the original intention (plausible
candidtates are predicate
logic, formal specification languages, clearly written English text which
your staff programmer can turn into executables, Perl programs ...).

I'd be interested in any programs which clearly demonstrate the need to
check something which goes beyond what is conveniently expressible in
DTDs, and in how they choose to do it.

Chris

Email: Chris.Brew@edinburgh.ac.uk
Address: Language Technology Group, HCRC,
2 Buccleuch Place, Edinburgh EH8 9LW,Scotland
Telephone: +44 131 650 4632 Fax: +44 131 650 4587