Re: [Corpora-List] Legal aspects of compiling corpora

From: Doug Cooper (doug@th.net)
Date: Thu Jun 19 2003 - 10:49:03 MET DST

  • Next message: Mike Scott: "Re: [Corpora-List] Legal aspects of compiling corpora"

    Dear Corporeans:

    For the record, here is my attempt at drafting a basic statement
    of professional practice in regard to using text corpora. It
    describes a well-understood (and I hope easily defended)
    subset of corpus applications, in service of setting up -- but
    not asserting the conclusion of -- the following syllogism:

     - some research use of copyrighted texts is protected by law;
     - here are some ways we use copyrighted texts for research;
     - perhaps our research uses are also protected.

      Please take this in the spirit offered -- as an example of what
    a reasonably framed position might look like, and not as an
    ideological stance ;-). I hope it may encourage the articulation
    of alternative statements, and/or discussion of the propriety of
    taking a position at all.

      Also, given its length, for which I apologize, _please_ don't
    automatically include the whole thing in responses.

      The texts of the relevant parts of the US copyright law and
    Berne Conventions, and a few of my own comments, follow.

    -----------------------------------------------
         A Statement on Research Use of Generic Text Corpora

       This statement on professional practice is intended to help
       researchers act in good faith observance of established
       research practice when assembling or using generic text
       corpora that may include copyrighted materials.

       The statement does not claim that such usage is necessarily
       entitled to a "fair use" or "fair practice" exemption; only that
       the methodology it describes is bona fide research practice.
                       ____________________

    A variety of scientific and educational disciplines rely on
    studying, or extracting samples from, large bodies of text --
    corpora. We will refer to these as 'generic text corpora'
    in order to distinguish them from more specific collections
    of particular authors or factual genres (eg. legal decisions).

      Generic corpora almost invariably include copyrighted texts.
    However, because of the "fair use" or "fair practice" rights
    granted by typical copyright laws (the US law and Berne
    Conventions, respectively) the inclusion of copyrighted
    texts in generic corpora does not necessarily entail any
    copyright violation.

      The exact line between copyright protection and fair
    use/practice rights is intentionally vague. Neither a claim
    of copyright, nor of fair use/practice exception, automatically
    trumps the other.

      But while we cannot fix an explicit definition of what research
    applications will always qualify as fair use/practice, we can
    clearly state that certain kinds of use are bona fide research
    practices.

      By definition, generic text corpora are not of interest as
    either literary or factual works. Rather, they are inspected
    for one of two basic reasons:

     - to investigate text properties through statistical analysis;
     - to extract and cite small examples, typically < 100
       contiguous characters, that elucidate word or phrase
       syntax, semantics, or other lexical features.

      In the first case, text is not necessarily returned at all; rather,
    we return overviews of various text properties. If the
    underlying text is revealed, it is only in a purely factual
    manner; eg. in lists of word or phrase frequency counts.

      In the second, the researcher or student is only interested
    in some factual aspect - typically syntax or semantics - of
    this particular arrangement of words; eg. in the citation:

     'single man in possession of a good =>fortune<= must be in want of a wife'

    it is a human's ability to understand the semantics of
    "fortune" that is of interest, rather than the literary or social
    commentary of the context.

      In either case, the contents of a generic text corpus are
    not, and cannot be, read as ordinary texts in the course of
    research use. Moreover, making the contents of a generic
    text corpus available in a manner that _might_ let its contents
    be reconstructed as literary or factual works is not a typical
    research application for generic text corpora.

                           END OF STATEMENT
    ----------------------------------------------------------
    Berne Convention Article 10

    (1) It shall be permissible to make quotations from a work which
    has already been lawfully made available to the public, provided
    that their making is compatible with fair practice, and their
    extent does not exceed that justified by the purpose, including
    quotations from newspaper articles and periodicals in the form of
    press summaries.

    (2) It shall be a matter for legislation in the countries of the
    Union, and for special agreements existing or to be concluded
    between them, to permit the utilization, to the extent justified
    by the purpose, of literary or artistic works by way of illustration
    in publications, broadcasts or sound or visual recordings for
    teaching, provided such utilization is compatible with fair practice.

    (3) Where use is made of works in accordance with the preceding
    paragraphs of this Article, mention shall be made of the
    source, and of the name of the author, if it appears thereon.
    ===========
    US Copyright Law Section 107 Limitations on Exclusive Rights: Fair use

      Notwithstanding the provisions of sections 106 and 106A, the
    fair use of a copyrighted work, including such use by reproduction
    in copies or phonorecords or by any other means specified by that
    section, for purposes such as criticism, comment, news reporting,
    teaching (including multiple copies for classroom use), scholarship,
    or research, is not an infringement of copyright.

      In determining whether the use made of a work in any particular
    case is a fair use the factors to be considered shall include -
     (1) the purpose and character of the use, including whether such use
      is of a commercial nature or is for nonprofit educational purposes;
     (2) the nature of the copyrighted work;
     (3) the amount and substantiality of the portion used in relation
      to the copyrighted work as a whole; and
     (4) the effect of the use upon the potential market for or
      value of the copyrighted work.
    ---------------------------------------------------------
    Comment from Doug Cooper:

    Now, beating around the bush aside, it seems to me that any
    common-sense reading of the relevant sections of the US or
    Berne copyright regulations make it clear that providing on-line
    access to generic text corpora is protected.

      While the US law is more explicit, a survey of EU laws notes
    that in general, 'fair practice' means copying for personal,
    scientific, educational, or other private use, etc. [Eisenchitz, T.
    and P. Turner. 1997. _Rights and Responsibilities in the Digital
    Age: Problems with Stronger Copyright in an Information Society.
    Journal of Information Science, 23(3):209-223,]
    NB - I couldn't find the article on-line; however, it appears to
    be the canonical citation.

      The key factor under US law is that _all_ the exceptions
    under 107 must be taken into account. Moreover, it appears
    to be consistently the case that the _possibility_ of copyright
    violation is also only one factor, and may be outweighed by
    legitimate fair use applications.

      IMHO, the scorecard for an on-line generic text corpus
    used as described above would be:

    1. 'purpose and character' - non-commercial research use
       that in general requires transforming (and cannot supercede)
       the original work.
    2. 'nature of copyrighted work' - mixed.
    3. 'amount and substantiability' - miniscule.
    4. 'effect of use on market' - nil; it cannot supercede the
       work, and the financial rewards offered by the 'inclusion
       in generic text corpora' market are presumeably zilch.

      As far as I can tell -- and I have gone to the CNI-CopyRight
    and BookPeople mailing lists seeking alternative views to no
    avail -- simply putting a copyrighted work into a black box
    isn't the issue. Rather, it's the use to which we put that black
    box; ie. the bits of copyrighted text that can be downloaded.

      In closing, I am concerned by suggestions that it is necessary
    or even advisable to obtain permissions, and possibly pay
    compensation, before using texts in the generic manner
    described above. While this may be a consistent position
    for corpus developers who are also publishers, it may
    unnecessarily discourage researchers in other environments.

      Best,
      Doug Cooper



    This archive was generated by hypermail 2b29 : Thu Jun 19 2003 - 10:48:48 MET DST