Re: [Corpora-List] Punctuation

From: Jane A. Edwards (edwards@ICSI.Berkeley.EDU)
Date: Fri Jan 14 2005 - 00:54:02 MET

  • Next message: Priscilla Rasmussen: "[Corpora-List] UK: Second ACL-SIGSEM Workshop on The Language Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications--Deadline Extension"

    Hello Keith,

    Oops! The fact you are addressing me personally makes me think
    my posting may have been construed as critical of ANC. Not intended.
    It was only by coincidence that I leapt into the flow after Nancy Ide.

    My intent was to focus on the interpretation of punctuation by users of
    any corpus, and on two references which I found interesting with
    reference to punctuation in general (though not scare quotes per se).

    Thank you very much for your posting, though, as I had not known
    of those great properties of ANC:
    - inclusion of written, spoken, AND written to be spoken;
    - move toward compatibility with Xaira as search tool;
    - preserving the integrity the data source's punctuation conventions.
    Wonderful!

    > I hope this helps,
    > Keith

    Thanks,

    -Jane

            From suderman@cs.vassar.edu Thu Jan 13 09:59:34 2005
            Date: Thu, 13 Jan 2005 13:02:09 -0500
            From: Keith Suderman <suderman@cs.vassar.edu>
            Subject: Re: [Corpora-List] Punctuation
            X-Sender: suderman@pop.cs.vassar.edu (Unverified)
            To: "Jane A. Edwards" <edwards@ICSI.Berkeley.EDU>, corpora@lists.uib.no,
                    ide@cs.vassar.edu
            Cc: suderman@cs.vassar.edu, tg21@leicester.ac.uk
            MIME-version: 1.0
            Content-transfer-encoding: 7BIT

            Hello Jane,

            All of the texts in the ANC are classified as written, spoken, or written
            to be spoken. The texts are also classified in other ways (domain,
            audience, etc.), so depending on the sophistication of the search tool you
            are using it is possible to query any subset of the corpus you wish. While
            we do not supply our own search tool, any tool that works with an XML or
            TEI compliant corpus should also work with the ANC. In particular, we are
            working closely with the folks at the BNC to ensure that the ANC is
            compatible with their search tool Xaira
            (http://www.oucs.ox.ac.uk/rts/xaira/index.xml).

    > Also, the number of punctuation marks which are used and which ones
    > in particular can have a large impact on the "meaning" of any particular
    > one of them.

            Whenever possible we have left the punctuation as is. For example, various
            texts may represent double quotes with the double quote character, two
            single quotes, two back ticks (``) followed by two single quotes, or some
            other ISO character that looks similar to the double quote character. Our
            goal is to provide the raw data, as we receive it, and let more informed
            people make the "strategic choices".

            Unfortunately, you will find punctuation in some of our spoken texts. This
            is because those texts already had the punctuation when we received the
            files and removing it would just be another manipulation of the data.

            I hope this helps,
            Keith

            At 05:54 PM 1/11/2005 -0800, you wrote:
    >One cautionary note (though perhaps it is obvious):
    >the clearest cases for punctuation analysis will be those drawn
    >from *written* language corpora (e.g., Brown and LOB).
    >Although spoken language corpora contain punctuation marks, these do
    >not necessarily follow the conventions of written language, but rather
    >are sometimes strategic choices for encoding prosody to some extent
    >within the constraints of standard keyboards (i.e., without resorting
    >to special characters).
    >
    >Also, the number of punctuation marks which are used and which ones
    >in particular can have a large impact on the "meaning" of any particular
    >one of them. (I've written on this elsewhere if of interest.)
    >This point is of course partly related to Eric Atwell's point:
    > > (usage depends on original sources so there is no corpus-wide
    > > standardised punctuation)
    >which is also important.
    >
    >I can't resist mentioning two important works for background lit:
    >1) Quirk, et al. A Comprehensive grammar of the English language.
    > London ; New York : Longman, 1985. x, 1779 p. : 26 cm.
    >Everyone knows this course, but I think the sections on punctuation
    >are not given nearly the attention they deserve.
    >2) For punctuation in historical context, I would also recommend
    >the following:
    > Parkes, M. B. (Malcolm Beckwith)
    > Pause and effect : an introduction to the history of punctuation
    > in the West / M.B. Parkes.
    > Berkeley : University of California Press, c1993.
    >
    >Parkes is often overlooked, but is fascinating, and full of plates
    >which go all the way back to ancient texts (Greek and Latin).
    >He makes a very strong point to the effect that punctuatino has served
    >very different functions at different points in time, depending on
    >the nature of the audience for which the text was put into writing.
    >In Ancient Greece, one important use of writing was to preserve
    >spoken language and help students become better orators.
    >The claim is made that people didn't read silently until much later.
    >
    >Another point perhaps of interest: the amount of punctuation
    >in the Bible varied greatly from on era to the next depending on the
    >intended readership. When it was a homogenous readership (native
    >speaking monks), there was less punctuation ; later on, when it
    >was a more heterogenous readership in far-flung countries, there
    >tended to be more punctuation per page.
    >
    >-Jane Edwards

            --------------------------------------------------
            Keith Suderman
            Technical Specialist
            American National Corpus
            suderman@cs.vassar.edu
            http://americannationalcorpus.org



    This archive was generated by hypermail 2b29 : Fri Jan 14 2005 - 01:02:46 MET