Re: Corpora: Corpus markup checking programs

Arjan Loeffen (Arjan.Loeffen@let.ruu.nl)
Tue, 18 Nov 1997 10:16:33 +0100

Patrice Bonhomme wrote:

"But what you mentioned will not check the semantic integrity of corpus
encoding. For example, yuo can put every thing you want within a <P>
(let say a paragraph) and even if your data is not a paragraph while the
SGML syntax is correct ! As i know, there is no tool or software to
check that level of integrity."

Though we may never be able to determine what's inside the <P>...</P>
element on the natural language level (that's what Patrice will be
referring to), we *can* go some way in determining the validity of less
'natural' content. For example, SGML doesn't give you the means to
determine if a <date> contains a date, or a <link> contains anything
'linkable' when it's not an SGML link. HyTime has introduced
'architectural forms' and 'lexical types' for this.

HyTime (ISO/IEC 10744) is based on attaching linking semantics to
elements in a document, that extends the logic of SGML pure sang. In
order to determine that the info in the document is valid in the sense
of HyTime it defines a mechanism of specifying additional constraints
over element content and attributes (and more). What you get is a
'reportable SGML error' when there's an error due to an invalid SGML
construct, and a 'reportable architectural error' (RAE) when it is valid
SGML but invalid with respect to some architecture imposed on the
document (e.g. HyTime).

Now this clearly means that these constraints are inherent to the
architecture. And that the architecture must be *supported* by the
software you are using. In corpus building you may require HyTime
linking, so you process your documents not (only) with SGML based
software, but by HyTime software (which is SGML compliant). The
definitions of architectural forms (AFOs) are inherent to HyTime (and
are part of the standard as such).

As the creation of 'enabling architectures' (as the authors wish to call
them) is not restricted to a particular application domain, an attempt
is made to make it part of the 'global SGML vision'. Strangely this
attempt has resulted in an annex to the HyTime standard (it should be a
standard on it's own, in my opinion, like some other annexes to HyTime,
and DSSSL, for that matter). This has most probably a historical reason
only.

In the 1997 edition of the HyTime standard the definition of AFOs is
reported in the annex for 'architectural form definition requirements',
AFDR, annex A.3. This section only explains how you can assign elements
in your document to 'semantic (behavioral) classes'. In the same edition
A.2 lexical types are defined (lexical type definition requirements,
LTDR). With both annexes implemented in your software you can -at least-
add validation on the 'data level' (data content, attribute values) in
addition to the SGML markup level.

NSGMLS does offer a way of associating elements with architectures; it
does not support lexical types (as far as I can see). So, some
programming will be required to validate data for lexical form.
(Actually, lexical types are expressed in a language that is declared by
a <!NOTATION, so that should not be part of any SGML/architectural
engine -- only the way to *invoke* the rules would.)

I do not know to what extent AFOs and LTs will be implemented in future
software -- as far as I can see (aspects of) these rules have been
implemented in architectural engines, programs that cover a particular
semantic domain. For HyTime, the work of Technoteacher Inc. should be
mentioned. Their HyTime support is based on exactly these formal
specifications. I cannot determine if that work can easily be extended
to cover other architectures as well -- I would guess so; the
Technoteacher approach seems very strict, and it would be most likely
that the software module that handles architecture definition is
separate from the modules that implement the architectures.

Anyway, for any company that needs any kind of lexical validation that
extends SGML validation on large corpora it may be wise to check out
their internet site at http://www.techno.com/ (this is NOT an
advertisement!), and contact the people working there. I have CC'ed this
message to the director -- see if he feels the urge to reply.

Unless broad consensus on these definition requirements is reached (i.e.
the ISO/IEC standard is actually picked up by vendors) we may require a
simpler approach. At least, if we can express the lexical rules (lexical
form, normalization, constraints etc.), organize them in a single
'lexical driver document', and formally record the relation between an
element or architectural form and these rules, we may bridge the time
between now and the future in which consensus *does* exist on this level
(most probably in XML context). In that case we only have to rearrange
the set of lexical specs to fit the new requirements; we will not have
to re-write these specs. At least, if we start out the right way.

So, in corpus building we may decide to start recording these rules and
use scripts to apply them to our SGML document, awaiting full (formal)
support of lexical types. O.k., we need some programming there, but that
should be no problem; the problem (time, money) would be to determine
what these rules are, how to express them, and to what elements they
apply. In a way, we have to perform document analysis on the data, as
well as the structure level.

I personally would recommend SGML corpus creators to introduce this kind
of analysis in corpus design in a formal way.

All this, of course, will most probably not allow you to determine if a
<P> actually contains a 'paragraph'. Leading to the conclusion that this
message doesn't answer the question. But it may help *someone* at least.

Arjan.

--

Arjan Loeffen Computer & Arts, Faculty of Arts, Utrecht University Arjan.Loeffen@let.ruu.nl http://CandL.let.ruu.nl