Re: TEI for POS-tagging? Query language?

C. M. Sperberg-McQueen (U35395@UICVM.bitnet)
Wed, 05 Apr 95 09:24:47 CDT

On Wed, 5 Apr 1995 10:01:20 +0200 Torbjoern Lager said:
>I understand that the Text Encoding Initiative (TEI) has chosen SGML
>for linguistically motivated coding of corpus texts. I need to write a
>short piece describing this approach.
>
>Now I'm not sure I have understood exactly how to use SGML for the
>purpose of (say) part-of-speech coding. I imagine something along the
>following line (I'm not concerned about the actual tagset used, more
>about the general idea):
>
> <pn>John</pn><v>loves</v><pn>Mary</pn><full_stop>.</full_stop>
>
>Is this correct? If this isn't a good example of the use of the TEI
>approach, would someone please provide me with a better example.

A good question, and well put. I don't have time to do it justice, but
I will say, before running to the airport, that this IS one way to do it,
and before the TEI work group on linguistic analysis started its work,
this is more or less what I expected to come out of their work.

What came out, however, was much better (which is not surprising,
since the members of the work group were all better linguists than I
am, and probably just generally smarter). The problem with this
approach is that it ties users of the SGML tag set to a particular
set of parts of speech: you have to believe that there are such
things as verbs, nouns, etc. And you may have noticed, looking around,
that there is no set of parts of speech which commands anything like
consensus in any field concerned. Even in the restricted world of
English-language corpus linguistics, no two annotated corpora use the
same set of parts of speech.

Well, I thought, fine, we'll use a level of indirection. Instead of
tags for parts of speech, we'll use a tag which allows the user to
provide any value they WANT to for parts of speech. Something like

<word pos=pn>John</>
<word pos=v>loves</>
<word pos=pn>Mary</>
<punc type=full_stop>.</>

This is better at ensuring the intellectual independence of the user
and thus the intellectual integrity of the encoding. But it commits
the user to a belief in things called 'parts of speech'. And while
there are a lot of people who do believe in them, there are also a
number -- some very good linguists among them -- who don't. And also
there are a lot of people, linguists and others, interested in
linguistic phenomena other than parts of speech, or phrase structure.

How can a general purpose encoding scheme meet all their diverse needs?

By adding another level of generality. Instead of specifying any
value you like for POS and other features, but being tied to a specific
set of features, the TEI allows you to specify ANY FEATURE you are
interested in, and ANY VALUE you like for it, including atomic
values, or nested sets of features, etc.

The result is an exceptionally flexible system of annotation built
around the notion of feature structures (though not tied to the
linguistic theories of those linguists who use feature structures
prominently; it would be almost as close to the truth to refer to the
TEI annotation mechanisms as data-base-record structures.

I don't have the time to transcribe an example here, but examples
of word-class annotation may be found in the chapter on feature
structures in the TEI Guidelines. Get file p3fs.p3x or p3fs.doc
from ftp.ex.ac.uk (in pub/sgml/tei/p3 ...) or ftp-tei.uic.edu and
have a look.

>Also, how would an a good example of the coding of phrase structure
>look like? For the sentence "John loves Mary", say?

The same chapter also has examples of phrase-structure annotation.
(But the feature structure mechanism could be used equally well for
dependency grammar or other theories of syntax).

>Another question comes to my mind: Does the TEI consider it their task
>to design and specify a _query_language_ to match SGML-coded texts, or
>is that a problem left open to the implementors of tools? I mean, how
>would one, for example, specify a search for verbs immediately followed
>by nouns? Or a concordance of adjectives _not_ followed by nouns? As I
>understand it, tools that do useful things with TEI/SGML coded text are
>not yet available. Wouldn't a careful, formal specification of a query
>language speed up the process of developing such tools?

It might well; it's a good assignment for further research. The
kernel of a query language is described in connection with the TEI
'extended pointer' syntax in the chapter on hypertext linking
(that's files p3sa.p3x and p3sa.doc).

Thanks for the query; good luck on the short paper!

-C. M. Sperberg-McQueen
ACH / ACL / ALLC Text Encoding Initiative
University of Illinois at Chicago
u35395@uicvm.uic.edu / u35395@uicvm

"Clarity, Precision and Ease of use does not mean Confinement, Verbosity
and Futility." -Jean Pierre Gaspart