Re: [Corpora-List] Legal aspects of compiling corpora

From: Jason Eisner (jason@cs.jhu.edu)
Date: Sat Jun 14 2003 - 21:03:36 MET DST

  • Next message: Silvia Bernardini: "[Corpora-List] text categorisation - newspaper"

    Larry Spitz writes:

    > Aside from the legal aspect of collecting text are the legal aspects of
    > collecting scanned images of documents. For those of us who are interested
    > in the analysis of document images obtaining databases of images is quite
    > difficult, particularly generally available databases where the results of
    > individual research can be compared.
    >
    > Since the University of Washington and the University of Nevada, Las Vegas
    > have stopped publishing such databases, I do not know of anyone who is in
    > the process of doing so.

    Larry,

    The ACL Anthology at http://www.aclweb.org/anthology is such a
    database, containing about 44,000 pages so far. It is a fairly
    comprehensive archive of articles from the major computational
    linguistics conferences, journals, and workshops since they began in
    1979. Choose the US mirror to get the most up-to-date version.

    The anthology's editors may wish to jump in and correct me here, but I
    believe that all of the 20th-century papers were scanned in
    physically, as no electronic proceedings were available. The scans
    were done recently and are of high quality. The documents are
    provided as PDF image files that also seem to contain an OCR'd copy of
    the text, allowing the text to be highlighted and searched. The OCR
    has occasional mistakes, particularly on formulas, but generally seems
    excellent

    > one of the real problems is getting copyright permission on document images.

    The notice on the anthology says:

      COPYRIGHT: These materials are Copyright (C) 1979-2003
      ACL. Permission is granted to make copies for the purposes of
      teaching and research.

    Also note:

      The ACL requests your help to support this effort financially. The
      total cost of digitizing past publications will be approximately
      $50,000. All other activities associated with the project are being
      done with free labor. All the resulting materials will be available
      for free on the web.

    Cheers,
    Jason Eisner
    Johns Hopkins University



    This archive was generated by hypermail 2b29 : Sun Jun 15 2003 - 17:07:52 MET DST