Re: [Corpora-List] Corpus Sanitation - no

From: Christoph Neumann (neumann@nova.co.jp)
Date: Mon Dec 02 2002 - 04:09:08 MET

  • Next message: Daniel Midgley: "Re: [Corpora-List] Semantic Distances Revisited"

    As a mostly quiet observer of Corpora discussion, I was really shocked by the underlying tendencies in the last threads.

    Mr. Zheng is using "uncensored" with the connotation "bad, undesirable",(inadvertently?) implying that the opposit, "censored" is something desirable..... In fact, my immediate reaction to this thread's subject itself was cold sweat. "corpus" + "sanitation" --- the word "sanitation" in the meaning "clean-up of unwanted members that dont fit in with current standards" was last used as a euphemism for the killing of disabled and homosexual people in Nazi Germany ('"Volkshygiene").

    I hope that we are never going to be politically, sexually, religiously "correct", but only scientifically correct and adequate.

    Should MT systems, for instance, refuse to translate sentences like "Lets blow up the imperialist WTC of devil America", "I think God/Buddha/Allah is an asshole" or "Are there any nice swinger clubs in this town?"? Will we have "parent-guided" MT sponsored by Disney, or
    party-conform IR acknowledged by Chinese CP? No, please not.

    The fact that the lingua franca in the linguistic and NLP community is the language of the English-speaking countries, does not imply that our scientific standards are to be adapted to doubtful ethic "standards" in the Anglo-American society, or to any other system of values or beliefs.

    >>>
    >>> 2. Some questions, actually not a small number, contain some
    >>> uncensored words. I think these questions are improper to be in a
    >>> corpus.
    >>

    -- 
    Dr. Christoph Neumann 		neumann@crosslanguage.co.jp
    R&D MT, CrossLanguage KK
    Tokyo, Japan
    http://www.crosslanguage.co.jp/english/index.html
    



    This archive was generated by hypermail 2b29 : Mon Dec 02 2002 - 04:21:43 MET