Re: [Corpora-List] Punctuation

From: Nancy Ide (ide@cs.vassar.edu)
Date: Wed Jan 12 2005 - 00:22:07 MET

  • Next message: Jane A. Edwards: "Re: [Corpora-List] Punctuation"

    The American National Corpus is being represented using an XML format
    in which the original formatting is preserved in attributes, so in
    general you should be able to determine where scare quotes were used.

    The ANC First Release of 11 million words is available from the
    Linguistic Data Consortium (ldc@ldc.upenn.edu) for $75 for research
    use. However, within a couple of months a second release of approx. 20
    million words, which includes the 11 million words of the First
    release, will be available. The 1st release data included in the 2nd
    release will be much "cleaner" and many errors will have been fixed.

    Also, very soon (within a month) Mark Davies' web-based search and
    retrieval software for the BNC will also handle the ANC 1st release.
    The URL for his software is http://view.byu.edu.

    Nancy Ide

    On Jan 11, 2005, at 11:56 AM, Eric Atwell wrote:

    > Tim,
    > most English corpora since pioneering Brown and LOB in 1960s have
    > included punctuation, so any of these might do.
    > The British National Corpus from 1990s has the advantage of www-based
    > trail search, you can "try before you buy" at
    > http://sara.natcorp.ox.ac.uk/lookup.html
    >
    > For example I tried search term {'|"}
    > - regular expression finding all occurrences of ' or "
    > (usage depends on original sources so there is no corpus-wide
    > standardised punctuation)
    >
    > I'm not sure how to identify all and only scare quotes via such regular
    > expressions... good luck!
    >
    > Eric Atwell, school of Computing, Leeds University
    >
    >
    > On Tue, 11 Jan 2005, Grant, T. wrote:
    >
    >> I'm looking for a freely accessible English language corpus which
    >> allows analysis of punctuation marks - I'm interested for example in
    >> examining the use of scare quotes.
    >>
    >> Any ideas gratefully received.
    >>
    >> Tim
    >>
    >> ______________________________________
    >> Tim Grant
    >> Forensic Section - School of Psychology
    >> University of Leicester
    >> 106 New Walk
    >> Leicester LE1 7EA
    >> UK
    >>
    >> TG21@leicester.ac.uk
    >> http://www.le.ac.uk/psychology/tg21/
    >>
    >> + 44(0)116 252 3658 (Direct Line) - + 44(0)116 252 2451 (Secretary) -
    >> + 44(0)116 252 3994 (Fax)
    >>
    >>
    >>
    >
    > --
    > Eric Atwell, Senior Lecturer, Computer Vision and Language research
    > group,
    > School of Computing, University of Leeds, LEEDS LS2 9JT, England
    > TEL: +44-113-2335430 FAX: +44-113-2335468
    > http://www.comp.leeds.ac.uk/eric
    >
    >
    =======================================================

    Nancy Ide

    Professor of Computer Science
    Vassar College
    Poughkeepsie, NY 12604-0520 USA
    Tel: +1 845 437-5988 Fax: +1 845 437-7498
    ide@cs.vassar.edu

    Chercheur Associe
    Equipe Langue et Dialogue, LORIA/CNRS
    Campus Scientifique - BP 239
    54506 Vandoeuvre-les-Nancy FRANCE
    Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
    ide@loria.fr

    =======================================================



    This archive was generated by hypermail 2b29 : Wed Jan 12 2005 - 00:38:19 MET