RE: [Corpora-List] Is the TEI a waste of time?

From: Oliver Christ (oliver.christ@trados.com)
Date: Tue Jul 01 2003 - 12:27:27 MET DST

  • Next message: Linguistic Data Consortium: "[Corpora-List] New LDC Releases"

    Hi,

    I find this discussion very interesting, but would like to learn more about
    what those who are more familiar with the topic than I am have to say about
    TEI's "competitors", e.g. CES/XCES (http://www.cs.vassar.edu/CES/ and
    http://www.cs.vassar.edu/XCES/).

    Cheers, Oli

    > -----Original Message-----
    > From: Sylvain Loiseau [mailto:sylvain@toucheraveclesyeux.com]
    > Sent: Tuesday, July 01, 2003 10:59 AM
    > To: Marco Baroni; corpora@uib.no
    > Subject: Re: [Corpora-List] Is the TEI a waste of time?
    >
    >
    > From: "Marco Baroni" <baroni@sslmit.unibo.it>
    > > Obviously, this is not the current situation, and in the real world
    > > the presence of TEI-encoding can be a (minor) hassle, since
    > many tools
    > > you may want to use (pos taggers, morphological analyzers, machine
    > > learning packages, databases, command-line programs, your
    > own scripts)
    > > are not TEI-compatible, and TEI is not the easiest format
    > to deal with
    > > (as compared to, eg, tab-delimited text...)
    > >
    > > I suppose that the best way for people in favor of TEI to convince
    > > others to adopt the standard would be to provide all sorts of cool
    > > TEI-conformant tools: programs helping (manual and automated)
    > > TEI-encoding, programs that perform all sorts of linguistic and
    > > statistical analyses of TEI-encoded data, indexers and fast
    > searching
    > > engines for TEI-encoded corpora, TEI-db's, input/output conversion
    > > tools...
    >
    > I agree with this idea. It is surprising to see how little
    > software there is for TEI corpora. The TEI is a waste of time
    > only if the encoding is under-exploited - which is a problem
    > for the researcher, not for the TEI. As said G. Williams a
    > minimal encoding with hasty-pasted-header and
    > word-processor-regex encoding of <p> takes only a few minute.
    > But in order to exploit easily the encoding there is no
    > public framework or set of tools for treatment of TEI-corpus
    > - such as concordancer based on SAX stream, etc. Something
    > like a set of classes for calling parser, SAX rewriting,
    > etc., allowing just to insert SAX handlers or XSLT
    > stylesheets in the pipeline could be very useful. While XML
    > always gain ground when it normalizes both the standards and
    > the software methodologies, the TEI remain a pure standard.
    >
    > I think the TEI is obviously necessary for the view G.
    > Williams defends - a corpus is not a sac of words - and for
    > interoperability, etc. But I agree that the TEI is perhaps
    > "out to date" for some points: there is nothing for
    > morphosyntaxic or morphologic encoding, texts profiling, etc.
    > The TEI remains perhaps not sufficiently adapted to
    > linguistic corpora. This is quite obvious if we look at the
    > projects listed on tei-c.org : it is mainly philological uses
    > of the TEI.
    >
    > Sylvain Loiseau
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2b29 : Tue Jul 01 2003 - 12:28:12 MET DST