Re: [Corpora-List] Corpus Sanitation

From: Chris Brew (cbrew@ling.ohio-state.edu)
Date: Fri Nov 29 2002 - 17:20:11 MET

  • Next message: Adam Kilgarriff: "Re: [Corpora-List] Corpus Sanitation"

    Another comment on anonymisation. The problem is even worse if one wishes
    to make available (for whatever purpose) the audio or video tapes from
    which transcriptions have been prepared. I believe that considering this
    more challenging case also clarifies issues for text-only corpora. I'll
    assume video, which is the extreme point.

    There are two problems with video. No amount of signal manipulation
    however small, preserves the full scientific usefulness of the data.
    On the other hand, no reasonable amount of "anonymisation", however large,
    really ensures anonymity.

    The first point is obvious if one contemplates trying to do psycholinguistic
    experiments with the data. It would for example seriously compromise a
    comprehension study if proper names are bleeped out, or even replaced by
    others. No psychology reviewer would ever accept that such data is as
    naturalistic as untreated data.

    The second point is almost as obvious, because humans are adept at inferring
    personal identity from all kinds of things, including voice quality, ear
    shape, gait and so on. Therefore, whatever one does, short of blanking
    everything out,it is difficult to credibly claim that the risk of unintended
    identification has been avoided. Geoffrey Sampson's Christine documentation
    makes the case that such identification is highly undesirable.

    The only way I can see to handle this is to deal with the problem at the
    outset by making completely clear to participants what will happen to their
    data, and obtaining informed consent. If this is not done, the data is
    effectively lost to responsible researchers, and cannot be used except
    at risk of infringing participants rights. If, as in the case Sampson
    has, promises have been made to participants, those promises must be
    honoured. It may or may not be possible to recover data that is both
    useful and distributable under these circumstances.

    The same difficulties are also present for audio (the BNC audio has never
    been distributed, even though it exists). The risk of identification is just
    too great and the consequences of that too severe to be acceptable. Although
    the visual cues to identity are absent, speaking style persists. One may
    feel that the risks are less, but they still exist.

    And, here's the rub, the same arguments apply, albeit more weakly, to
    text-only corpora. While voice quality is now absent, substantial cues to
    personal identity may persist in lexis and other idiosyncrasies, not to
    mention that people are extremely adept at reconstructing material from
    context. Once again, the risks are arguably less than in the other media,
    but they still exist. So, notwithstanding valiant efforts to anonymise in
    such a way that the scientific usefulness of the data is preserved, the
    original decision to promise anonymity comes back to haunt us. I lean to the
    view that there is no difference of principle between the different media.

    Is even Sampson's rigorous approach to anonymization enough in practice?
    Perhaps, but that depends on a very iffy judgement call. The lesson seems
    to be that great care is needed in collecting informed consent for corpus
    work.

    None of this addresses the additional point made in Sampson's post about
    collateral damage to people and organisations not involved in the recording.
    I could imagine a prosecution against both participants and corpus
    distributors for defamation or slander. That would be bad. Perhaps corpus
    collectors need to indemnify participants against this, or perhaps it
    suffices to warn people that they are (in effect) speaking in a public
    place. Or perhaps we have a duty of care to ensure that our participants do
    not put themselves at risk (doubly likely since many corpora include
    contributions by children). And that leaves aside the much more likely cases
    where nasty stuff in the corpus evokes resentment and unhappiness, but not
    enough to lead to prosecutions.

    Chris

    ==================================================================
    Dr. Chris Brew, Assistant Professor of Computational Linguistics
    Department of Linguistics, 1712 Neil Avenue, Columbus OH 43210
    Tel: +614 292 5420 Fax: +614 292 8833
    Web:http://www.ling.ohio-state.edu/~cbrew Email:cbrew@ling.osu.edu
    ==================================================================

    >
    > On Zheng Zhiping's posting, to me there is an important difference between
    > "bad language" and individuals' names, or information that could lead to
    > identification of individuals. Like Tony McEnery, I don't believe there is
    > any real reason to censor the "bad language"; it is important linguistic
    > data, and we are all grown-ups. But I do think that before such a resource
    > is made public, strenuous efforts should be made to eliminate any possibility
    > of users identifying either the individuals who produced the material, or
    > any individuals or individual institutions written about. Actually, under
    > various national Data Protection laws I suspect it might be illegal not to
    > do this, even if the material is simply held at one institution and not
    > circulated. But it ought to be done anyway, for reasons that I discuss at
    > some length in the "ethics" section of the documention file accompanying
    > my CHRISTINE1 Corpus (available via the Web, from my home page
    > www.grsampson.net follow links to downloadable research resources and then
    > CHRISTINE). I discuss there what seems to me to have been inadequate
    > practice in this respect in the spoken section of the British National Corpus.
    > There are places where really damaging things are said in a quite casual
    > way in conversation about people, or organizations, who/which might easily
    > be identified by people who know them (and could probably be identified by
    > strangers with only minimal detective work). The recorded speakers had no
    > motive to worry about this, but I believe corpus linguists have a responsibility
    > not to let such casual gossip about identifiable people be turned into
    > permanent public records.
    >
    > Geoffrey Sampson
    >
    >
    > Prof. G.R. Sampson MA PhD MBCS
    >
    > Professor of Natural Language Computing
    > School of Cognitive & Computing Sciences
    > University of Sussex
    > Falmer, Brighton BN1 9QH, GB
    >
    > e-mail geoffs@cogs.susx.ac.uk (no attachments please)
    > tel. +44 1273 678525
    > fax +44 1273 671320
    > web http://www.grsampson.net

    -- 
    



    This archive was generated by hypermail 2b29 : Fri Nov 29 2002 - 23:13:21 MET