Re: [Corpora-List] Is the TEI a waste of time?

From: Sylvain Loiseau (sylvain@toucheraveclesyeux.com)
Date: Tue Jul 01 2003 - 10:58:39 MET DST

  • Next message: Oliver Christ: "RE: [Corpora-List] Is the TEI a waste of time?"

    From: "Marco Baroni" <baroni@sslmit.unibo.it>
    > Obviously, this is not the current situation, and in the real world the
    > presence of TEI-encoding can be a (minor) hassle, since many tools you
    > may want to use (pos taggers, morphological analyzers, machine learning
    > packages, databases, command-line programs, your own scripts) are not
    > TEI-compatible, and TEI is not the easiest format to deal with (as
    > compared to, eg, tab-delimited text...)
    >
    > I suppose that the best way for people in favor of TEI to convince
    > others to adopt the standard would be to provide all sorts of cool
    > TEI-conformant tools: programs helping (manual and automated)
    > TEI-encoding, programs that perform all sorts of linguistic and
    > statistical analyses of TEI-encoded data, indexers and fast searching
    > engines for TEI-encoded corpora, TEI-db's, input/output conversion
    > tools...

    I agree with this idea. It is surprising to see how little software there
    is for TEI corpora. The TEI is a waste of time only if the encoding is
    under-exploited - which is a problem for the researcher, not for the TEI.
    As said G. Williams a minimal encoding with hasty-pasted-header and
    word-processor-regex encoding of <p> takes only a few minute. But in order
    to exploit easily the encoding there is no public framework or set of tools
    for treatment of TEI-corpus - such as concordancer based on SAX stream,
    etc. Something like a set of classes for calling parser, SAX rewriting,
    etc., allowing just to insert SAX handlers or XSLT stylesheets in the
    pipeline could be very useful. While XML always gain ground when it
    normalizes both the standards and the software methodologies, the TEI
    remain a pure standard.

    I think the TEI is obviously necessary for the view G. Williams defends - a
    corpus is not a sac of words - and for interoperability, etc. But I agree
    that the TEI is perhaps "out to date" for some points: there is nothing for
    morphosyntaxic or morphologic encoding, texts profiling, etc. The TEI
    remains perhaps not sufficiently adapted to linguistic corpora. This
    is quite obvious if we look at the projects listed on tei-c.org : it is
    mainly philological uses of the TEI.

    Sylvain Loiseau



    This archive was generated by hypermail 2b29 : Tue Jul 01 2003 - 11:54:41 MET DST