Re: [Corpora-List] Is the TEI a waste of time?

From: David Graff (graff@unagi.cis.upenn.edu)
Date: Fri Jun 27 2003 - 12:44:59 MET DST

  • Next message: ak28: "RE: [Corpora-List] Is the TEI a waste of time?"

    geoffrey.williams@wanadoo.fr said:
    > Easy access to vast amounts of downloadable data has meant that a
    > number of "corpus linguists" neither know nor care about the niceties
    > of corpus creation, and the whys and wherefores of selecting and
    > marking up data. Ease of access has become the main criterion,
    > potentially to the detriment of the discipline itself. Easy solutions
    > do not necessarily answer the most pertinent questions.

    I agree wholeheartedly with these points. However, it is possible to
    devote all due attention and care to the "niceties, whys and wherefores"
    without strict adherence to the full details of TEI specifications. That
    is, one can create a quite useful corpus with a relatively simple and
    shallow markup structure, and with much of the information about the
    corpus content provided in as separate documentation, tables, or
    stand-off annotations (rather than as in-line markup attached to the
    data).

    I would differentiate between "ease of access" and "ease of use". Yes,
    easy access to downloadable data sets (e.g. pointing "wget -r ..." at
    any number of web sites) can lead to some very messy collections that
    won't answer any question very well (except "How quickly can you fill
    your hard disk?"); and cleaning up this sort of mess to produce useful
    language data is complicated, time-consuming work.

    But when that complicated work is actually done, the end product is most
    useful when it is easy to process, browse, search, summarize, etc. In
    this regard, I tend to prefer markup that supports and simplifies the
    computational uses of the data, and doesn't impose a heavy burden of
    parsing through complex headers or intrusive in-line tags, where much of
    the detail provided by the markup will tend to be irrelevant to any
    given task at hand.

    I have seen and/or been party to both extremes -- heavy markup vs. no
    markup. Regardless of where one chooses to sit on that scale, creating
    a corpus of good quality is still a lot of work. But other things being
    equal, a corpus with little markup can be just as useful as one with
    lots, and will tend to be easier to use.

    -----------
    David Graff Linguistic Data Consortium
    graff@ldc.upenn.edu 3600 Market St., Suite 810
    voice: (215) 898-0887 University of Pennsylvania
    fax: (215) 573-2175 Philadelphia, PA 19104
                    http://www.ldc.upenn.edu



    This archive was generated by hypermail 2b29 : Fri Jun 27 2003 - 12:46:27 MET DST