Re: Corpora: Santa Barbara Corpus

From: Christopher Cieri (ccieri@ldc.upenn.edu)
Date: Tue Aug 08 2000 - 02:12:21 MET DST

  • Next message: Gordon and Pam Cain: "Re: Corpora: Keywords in texts"

    Lou, Chris,

    Thanks for reminding us of a crucial issue in corpus distribution. Implicit in
    this discussion is, I think, acknowledgement that any single format will be
    appealing to those research communities who have adopted it but not
    necessarily to others. We chose the particular formats used for SBCSAE after
    consulting with the corpus developers who felt that those formats would be
    most appropriate to the research communities most likely to use the data.
    However, I don't want to imply that this closes the discussion. As you know,
    LDC is very interested in the issues of standards and tools for access to
    shared data and the problems of corpus reuse and reannotation (see
    http://www.itl.nist.gov/iaui/894.01/atlas/,
    http://www.ldc.upenn.edu/sb/isle.html, http://www.talkbank.org/,
    http://www.ldc.upenn.edu/Papers/LREC2000/multiuse.pdf). We welcome suggestions
    on ways to make our corpora more useful and would certainly consider any
    reasonable request from a research community to provide data in an alternate
    format.

    Best wishes,
    Chris

    Lou Burnard wrote:

    > On Mon, 7 Aug 2000, Chris Manning wrote:
    >
    > |On 7 August 2000, Lou Burnard wrote:
    > | > Hmm. So instead of using pre-existing standards which at least have a
    > | > chance of being implemented across different computer platforms, it's
    > | > better to make up an entirely arbitrary set of codes of your own for
    > | > which *everyone* has to write their own software?
    > |
    > |This is a little harsh. The transcription format used has existed and
    > |been developed for many years in the conversational/discourse analysis
    > |community -- and versions of it can be found in books such as Edwards'
    > |Talking Data: Transcription and Coding in Discourse Research or
    > |Schiffrin's Approaches to Discourse.
    > |
    > |At most the LDC could be faulted for leaving the data in such a format
    > |-- one clearly designed more for human observation than easy computer
    > |manipulation -- rather than converting it to a more computer friendly
    > |standard markup.
    >
    > Fair point, well made. Thanks Chris! Put the harshness down to my
    > general gloom at being confronted with 300 email messages after ten
    > days on a beach in Northern Portugal... But the devil in all digital
    > affairs is in the detail and it's that phrase "versions of it" that gives
    > away why it's a retrograde step for a project with such high visibility,
    > importance, and resourcing to distribute such wonderful data in a way
    > that makes it REALLY DIFFICULT for a computer to analyse it. If we're only
    > concerned to produce data for humans to read, let's print it out on bits
    > of paper.
    >
    > Lou
    >
    > ----------------------------------------------------------------
    > Lou Burnard http://users.ox.ac.uk/~lou
    > ----------------------------------------------------------------

    --
    Christopher Cieri
    Executive Director, Linguistic Data Consortium
    3615 Market Street, Philadelphia, PA 19104-2608 USA
    phone: 215-573-5489, fax: 215-573-2175
    mailto:Christopher.Cieri@ldc.upenn.edu
    http://www.ldc.upenn.edu
    




    This archive was generated by hypermail 2b29 : Tue Aug 08 2000 - 02:03:11 MET DST