Re: [Corpora-List] Punctuation

From: Jane A. Edwards (edwards@ICSI.Berkeley.EDU)
Date: Wed Jan 12 2005 - 02:54:11 MET

  • Next message: ted pedersen: "[Corpora-List] Preliminary Call for Demos/Posters ACL-2005"

    One cautionary note (though perhaps it is obvious):
    the clearest cases for punctuation analysis will be those drawn
    from *written* language corpora (e.g., Brown and LOB).
    Although spoken language corpora contain punctuation marks, these do
    not necessarily follow the conventions of written language, but rather
    are sometimes strategic choices for encoding prosody to some extent
    within the constraints of standard keyboards (i.e., without resorting
    to special characters).

    Also, the number of punctuation marks which are used and which ones
    in particular can have a large impact on the "meaning" of any particular
    one of them. (I've written on this elsewhere if of interest.)
    This point is of course partly related to Eric Atwell's point:
    > (usage depends on original sources so there is no corpus-wide
    > standardised punctuation)
    which is also important.

    I can't resist mentioning two important works for background lit:
    1) Quirk, et al. A Comprehensive grammar of the English language.
         London ; New York : Longman, 1985. x, 1779 p. : 26 cm.
    Everyone knows this course, but I think the sections on punctuation
    are not given nearly the attention they deserve.
    2) For punctuation in historical context, I would also recommend
    the following:
         Parkes, M. B. (Malcolm Beckwith)
           Pause and effect : an introduction to the history of punctuation
             in the West / M.B. Parkes.
           Berkeley : University of California Press, c1993.

    Parkes is often overlooked, but is fascinating, and full of plates
    which go all the way back to ancient texts (Greek and Latin).
    He makes a very strong point to the effect that punctuatino has served
    very different functions at different points in time, depending on
    the nature of the audience for which the text was put into writing.
    In Ancient Greece, one important use of writing was to preserve
    spoken language and help students become better orators.
    The claim is made that people didn't read silently until much later.

    Another point perhaps of interest: the amount of punctuation
    in the Bible varied greatly from on era to the next depending on the
    intended readership. When it was a homogenous readership (native
    speaking monks), there was less punctuation ; later on, when it
    was a more heterogenous readership in far-flung countries, there
    tended to be more punctuation per page.

    -Jane Edwards

            From owner-corpora@lists.uib.no Tue Jan 11 15:48:57 2005
            Cc: "Grant, T." <tg21@leicester.ac.uk>, Nancy Ide <ide@cs.vassar.edu>,
                    Keith Suderman <suderman@cs.vassar.edu>
            From: Nancy Ide <ide@cs.vassar.edu>
            Subject: Re: [Corpora-List] Punctuation
            Date: Tue, 11 Jan 2005 18:22:07 -0500
            To: corpora@lists.uib.no
            X-Virus-Scanned: by amavisd-new-20030616-p9 (Debian) at cs.vassar.edu
            X-checked-clean: by exiscan on noralf
            X-Scanner: fac4ae74441f46a01336a951083fb4fe http://tjinfo.uib.no/virus.html
            X-UiB-SpamFlag: NO UIB: 0.0 hits, 11.0 required

            The American National Corpus is being represented using an XML format
            in which the original formatting is preserved in attributes, so in
            general you should be able to determine where scare quotes were used.

            The ANC First Release of 11 million words is available from the
            Linguistic Data Consortium (ldc@ldc.upenn.edu) for $75 for research
            use. However, within a couple of months a second release of approx. 20
            million words, which includes the 11 million words of the First
            release, will be available. The 1st release data included in the 2nd
            release will be much "cleaner" and many errors will have been fixed.

            Also, very soon (within a month) Mark Davies' web-based search and
            retrieval software for the BNC will also handle the ANC 1st release.
            The URL for his software is http://view.byu.edu.

            Nancy Ide

            On Jan 11, 2005, at 11:56 AM, Eric Atwell wrote:

    > Tim,
    > most English corpora since pioneering Brown and LOB in 1960s have
    > included punctuation, so any of these might do.
    > The British National Corpus from 1990s has the advantage of www-based
    > trail search, you can "try before you buy" at
    > http://sara.natcorp.ox.ac.uk/lookup.html
    >
    > For example I tried search term {'|"}
    > - regular expression finding all occurrences of ' or "
    > (usage depends on original sources so there is no corpus-wide
    > standardised punctuation)
    >
    > I'm not sure how to identify all and only scare quotes via such regular
    > expressions... good luck!
    >
    > Eric Atwell, school of Computing, Leeds University
    >
    >
    > On Tue, 11 Jan 2005, Grant, T. wrote:
    >
    >> I'm looking for a freely accessible English language corpus which
    >> allows analysis of punctuation marks - I'm interested for example in
    >> examining the use of scare quotes.
    >>
    >> Any ideas gratefully received.
    >>
    >> Tim
    >>
    >> ______________________________________
    >> Tim Grant
    >> Forensic Section - School of Psychology
    >> University of Leicester
    >> 106 New Walk
    >> Leicester LE1 7EA
    >> UK
    >>
    >> TG21@leicester.ac.uk
    >> http://www.le.ac.uk/psychology/tg21/
    >>
    >> + 44(0)116 252 3658 (Direct Line) - + 44(0)116 252 2451 (Secretary) -
    >> + 44(0)116 252 3994 (Fax)
    >>
    >>
    >>
    >
    > --
    > Eric Atwell, Senior Lecturer, Computer Vision and Language research
    > group,
    > School of Computing, University of Leeds, LEEDS LS2 9JT, England
    > TEL: +44-113-2335430 FAX: +44-113-2335468
    > http://www.comp.leeds.ac.uk/eric
    >
    >
            =======================================================

            Nancy Ide

            Professor of Computer Science
            Vassar College
            Poughkeepsie, NY 12604-0520 USA
            Tel: +1 845 437-5988 Fax: +1 845 437-7498
            ide@cs.vassar.edu

            Chercheur Associe
            Equipe Langue et Dialogue, LORIA/CNRS
            Campus Scientifique - BP 239
            54506 Vandoeuvre-les-Nancy FRANCE
            Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
            ide@loria.fr

            =======================================================



    This archive was generated by hypermail 2b29 : Wed Jan 12 2005 - 03:19:02 MET