[Corpora-List] Corpus Sanitation

From: Geoffrey Sampson (geoffs@cogs.susx.ac.uk)
Date: Fri Nov 29 2002 - 11:18:37 MET

  • Next message: Hristo Tanev: "[Corpora-List] Re: "Question Answering" corpora"

    On Zheng Zhiping's posting, to me there is an important difference between
    "bad language" and individuals' names, or information that could lead to
    identification of individuals. Like Tony McEnery, I don't believe there is
    any real reason to censor the "bad language"; it is important linguistic
    data, and we are all grown-ups. But I do think that before such a resource
    is made public, strenuous efforts should be made to eliminate any possibility
    of users identifying either the individuals who produced the material, or
    any individuals or individual institutions written about. Actually, under
    various national Data Protection laws I suspect it might be illegal not to
    do this, even if the material is simply held at one institution and not
    circulated. But it ought to be done anyway, for reasons that I discuss at
    some length in the "ethics" section of the documention file accompanying
    my CHRISTINE1 Corpus (available via the Web, from my home page
    www.grsampson.net follow links to downloadable research resources and then
    CHRISTINE). I discuss there what seems to me to have been inadequate
    practice in this respect in the spoken section of the British National Corpus.
    There are places where really damaging things are said in a quite casual
    way in conversation about people, or organizations, who/which might easily
    be identified by people who know them (and could probably be identified by
    strangers with only minimal detective work). The recorded speakers had no
    motive to worry about this, but I believe corpus linguists have a responsibility
    not to let such casual gossip about identifiable people be turned into
    permanent public records.

    Geoffrey Sampson

    Prof. G.R. Sampson MA PhD MBCS

    Professor of Natural Language Computing
    School of Cognitive & Computing Sciences
    University of Sussex
    Falmer, Brighton BN1 9QH, GB

    e-mail geoffs@cogs.susx.ac.uk (no attachments please)
    tel. +44 1273 678525
    fax +44 1273 671320
    web http://www.grsampson.net



    This archive was generated by hypermail 2b29 : Fri Nov 29 2002 - 11:36:20 MET