RE: [Corpora-List] Legal aspects of compiling corpora

From: Mark Sanderson (m.sanderson@sheffield.ac.uk)
Date: Tue Jun 17 2003 - 16:51:51 MET DST

  • Next message: Eric Atwell: "Re: [Corpora-List] Genre analysis papers online|"

    Google does have pages outlining how to have content removed from its
    collections

             http://www.google.com/remove.html

    which towards the bottom mentions removal of images because of Digital
    Millennium Copyright Act problems.

    In searching around I also found this web page which seems to imply that
    people do want images removed

             http://www.chillingeffects.org/dmca512/notice.cgi?NoticeID=565

    Now I don't think this happens with text simply because old bits of ASCII
    aren't perceived to have as much value as images tend to have.

    I'm sure one of the reasons why people like TREC and others can negotiate
    copyright release deals to build corpora or test collections is that the
    owners don't perceive their data has great value and so they are willing to
    live with the risk of having the material copied illegally once they have
    released it.

    You'll notice there are very few image test collections with interesting
    content because IR people have struggled to find image owners willing to
    let their images go.

    So my feeling is that yes collecting text may be illegal, but it is in
    general of so little value (compared to other media) that people are
    unlikely to sue you.

    At 08:54 17/06/2003 -0700, Mark Davies wrote:
    >When I was compiling the 100 million word Corpus del Espaņol
    >(www.corpusdelespanol.org), I
    >consulted two professors from the US who are experts on copyright law, as
    >applied to the
    >Internet. I explained to them that in my corpus, at least, end users
    >wouldn't have access
    >to etnire paragraphs of text, much less an entire text itself. Both were
    >in agreement
    >that it would be quite unlikely that there would be any copyright problems.
    >
    >What has me intrigued with search engines like Google, however, is their
    >"cached web page"
    >functionality, in which they are in essnce reproducing an entire web page
    >-- and all of
    >the web pages of a given site (assuming no use of robots.txt). It seems
    >that this is much
    >more than the limited context that I ( and others) make available in our
    >corpora, and yet
    >there has been no legal challenge.
    >
    >On the other hand, both of the professors who I consulted mentioned that
    >it's still a very
    >murky issue with little or no clearly defined legal precedent -- at least
    >in the US.
    >
    >Mark Davies
    >
    >=================================================
    >Mark Davies
    >Assoc. Prof., Spanish Linguistics
    >Illinois State University
    >http://mdavies.for.ilstu.edu/
    >
    >** Corpus design and use // Web-database scripting **
    >** Historical and dialectal Spanish and Portuguese syntax **
    >=================================================

    _________________________________________________________________________
    Mark Sanderson, Room 303 Tel: +44 (0) 114 22 22648
    Department of Information Studies Fax: +44 (0) 114 27 80300
    University of Sheffield, Regent Court, mailto:m.sanderson@shef.ac.uk
    211 Portobello St., Sheffield, S1 4DP, UK http://dis.shef.ac.uk/mark/
    _________________________________________________________________________
    Good judgement comes from experience, experience comes from bad judgement



    This archive was generated by hypermail 2b29 : Tue Jun 17 2003 - 16:50:15 MET DST