Re: Corpora: Anonymisation

James L. Fidelholtz (jfidel@siu.buap.mx)
Tue, 9 Mar 1999 10:34:57 -0600 (CST)

On Mon, 8 Mar 1999, Frances Rock wrote:

>Dear all
>
>I am currently preparing a short paper on anonymisation of data and
>have three sets of questions about this.
[snip]
>1) Your practical experiences: Have you ever anonymised data (eg for
>inclusion within a corpus) by removing personal names, place names,
>business names etc.?

I'm in the preliminary stages of collecting a Spanish corpus. However,
I don't like anonymization for a number of reasons. One of my personal
interests is in working on proper nouns, and anonymization is obviously
a problem when it comes to statistics, etc. This is indeed a problem
for stuff not in the public domain (speech and telephone transcriptions,
email, etc.), and it may be important, while not anonymizing, to
nevertheless caution users about not publishing certain portions (those
which an anonymizer would anonymize) of the corpus, which would then
have to be flagged in some way.

[snip]

>3) General/Theoretical questions: What is anonymisation? In
>pursuit of this: When, if ever, is anonymisation necessary? What
>exactly should be anonymised? What can be used to replace items which
>'need' to be anonymised? What kinds of information can reasonably be
>preserved without an infringement of individuals' rights? What kinds
>of information need to be preserved to aid effective analysis?

See above comments. I certainly wouldn't anonymize anything in the
public domain already (ie, published stuff).

>Several people have commented that this is a bit of a 'non-issue', I
>am also interested in hearing more about that point of view.

Those of us who are academics would like to think that this is a
non-issue, but if libelous stuff finds its way inadvertently or
otherwise into the corpus, anyone (ie, a publishing house) who publishes
the data in any way would then be liable under the laws of the country
in question (esp. US and UK, but other countries have weird laws: eg, in
the state of Puebla, Mexico, you can successfully be sued for libel even
if what you said is provably true!). While it is not, in my opinion, a
non-issue, I personally would not worry much about it, except to protect
people's privacy, perhaps by the method I mentioned above.

Jim

James L. Fidelholtz e-mail: jfidel@siu.buap.mx
Maestri'a en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Beneme'rita Universidad Auto'noma de Puebla, ME'XICO