Re: [Corpora-List] Corpus Sanitation

From: Adam Kilgarriff (Adam.Kilgarriff@itri.brighton.ac.uk)
Date: Sat Nov 30 2002 - 14:02:42 MET

  • Next message: PbIKOB_B.B.: "Re: [Corpora-List] Semantic Distances Revisited"

    All,

        As academics, we would like to leave the data entirely uncorrupted,
    so we'd rather not anonymise - but then ethical issues mean, for some
    purposes, we have to.

        Exactly the same applies to 'bad' (taboo, as LDOCE3 marks it)
    language. I have datasets I'd like to give easy access to, for language
    learners. Do I want children/young people accessing my website/CD-ROM
    to encounter taboo language? Will I be exposed to lawsuits from shocked
    parents if I do?

        Like anonymisation, it's hard. Throwing out sentences/texts with
    taboo words or strings is at least straightforward - you can find them
    without exhaustive reading. But, as with anonymisation where there are
    no explicit names, there are taboo texts without taboo words, so if you
    want to be confident you are not disseminating taboo language, if
    you're sources aim to cover the breadth of language use, an awful lot of
    reading is required. I recently had a conversation with a dictionary
    publisher facing the same predicament: yes, he did an awful lot of reading.

        Adam

    Mcenery, Tony wrote:

    >Dear All,
    >
    >I was interested to read in the recent posting to the list by Zhiping Zheng
    >(see below) that he was uncertain as to whether he should make his corpus
    >publicly available because it contained some 'uncensored words' (Zhiping's
    >point 2). I guess that this means 'bad language' (I assume it does not relate
    >to anonymization issues as they are covered in Zhiping's point 1).If this is
    >about 'bleeping out' words in corpora, shouldn't we encourage Zhiping not to do
    >this? Surely we want corpora to contain uncensored speech? The point, for me,
    >of using corpora is to describe/account for language as it is, rather than
    >language as we wish it to be.
    >
    >Best,
    >
    >Tony
    >
    >----- Original Message -----
    >From: "Zhiping Zheng" <zzheng@umich.edu>
    >To: <corpora@hd.uib.no>
    >Sent: 21 November 2002 22:57
    >Subject: Re: [Corpora-List] Looking for some corpora about why-questions,
    >how-questions, and their answers.
    >
    >
    >
    >>Dear all,
    >>
    >>I got several responses asking if I am planning to make my question
    >>list public. I think I should answer this question to the whole list.
    >>
    >>I am willing to make it public but I am not sure if I should do it
    >>right now. Here are the reasons:
    >>
    >>1. Some questions ask information about specific people, not only
    >>celebrities, but also probably the questioners or other people with
    >>very close relationships to the questioners. This may raise some
    >>privacy issues. I prefer to take off these questions before make the
    >>question corpus public.
    >>
    >>2. Some questions, actually not a small number, contain some
    >>uncensored words. I think these questions are improper to be in a
    >>corpus.
    >>
    >>3. Many questions are not grammatically correct or with some spell
    >>errors. I personally think this is ok becaues the questions are from
    >>real world. I don't know what other researchers think about this.
    >>
    >>4. Different researchers may have different expections. For example,
    >>the original poster of this thread required why- and how- questions,
    >>other people have asked about statistic information on specific phrase
    >>groups. I would like to know if there are some common requirements
    >>from most or many researchers.
    >>
    >>5. After I do something to the question archive and make it public, I
    >>am thinking of updating the public question corpus time to time. More
    >>efforts have to take and I am not sure if I have enough energy to do
    >>this. I hope some one is willing to join me.
    >>
    >>I am waiting for your inputs. Especially if you are willing to do
    >>something for building the corpus, I am happy to work with you.
    >>
    >>Many thanks.
    >>
    >>Zhiping
    >>
    >>
    >
    >
    >

    -- 
    New! MSc and Short Courses in Lexical Computing and Lexicography
    http://www.itri.brighton.ac.uk/lexicom
    

    ==================================================== Adam Kilgarriff Senior Research Fellow ITRI t: +44 (0)1273 642919 University of Brighton f: +44 (0)1273 642908 Lewes Road e: adam@itri.brighton.ac.uk Brighton BN2 0BL UK http://www.itri.brighton.ac.uk/~Adam.Kilgarriff ====================================================



    This archive was generated by hypermail 2b29 : Sat Nov 30 2002 - 14:18:59 MET