Re: [Corpora-List] Punctuation

From: Jane A. Edwards (edwards@ICSI.Berkeley.EDU)
Date: Fri Jan 14 2005 - 00:54:02 MET

Next message: Priscilla Rasmussen: "[Corpora-List] UK: Second ACL-SIGSEM Workshop on The Language Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications--Deadline Extension"

Previous message: Santos Diana: "[Corpora-List] HAREM: Call for participation"
Maybe in reply to: Grant, T.: "[Corpora-List] Punctuation"
Next in thread: Keith Suderman: "Re: [Corpora-List] Punctuation"
Reply: Keith Suderman: "Re: [Corpora-List] Punctuation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Keith,

Oops! The fact you are addressing me personally makes me think
my posting may have been construed as critical of ANC. Not intended.
It was only by coincidence that I leapt into the flow after Nancy Ide.

My intent was to focus on the interpretation of punctuation by users of
any corpus, and on two references which I found interesting with
reference to punctuation in general (though not scare quotes per se).

Thank you very much for your posting, though, as I had not known
of those great properties of ANC:
- inclusion of written, spoken, AND written to be spoken;
- move toward compatibility with Xaira as search tool;
- preserving the integrity the data source's punctuation conventions.
Wonderful!

> I hope this helps,
> Keith

Thanks,

-Jane

        From suderman@cs.vassar.edu Thu Jan 13 09:59:34 2005
        Date: Thu, 13 Jan 2005 13:02:09 -0500
        From: Keith Suderman <suderman@cs.vassar.edu>
        Subject: Re: [Corpora-List] Punctuation
        X-Sender: suderman@pop.cs.vassar.edu (Unverified)
        To: "Jane A. Edwards" <edwards@ICSI.Berkeley.EDU>, corpora@lists.uib.no,
                ide@cs.vassar.edu
        Cc: suderman@cs.vassar.edu, tg21@leicester.ac.uk
        MIME-version: 1.0
        Content-transfer-encoding: 7BIT

Hello Jane,

        All of the texts in the ANC are classified as written, spoken, or written
        to be spoken. The texts are also classified in other ways (domain,
        audience, etc.), so depending on the sophistication of the search tool you
        are using it is possible to query any subset of the corpus you wish. While
        we do not supply our own search tool, any tool that works with an XML or
        TEI compliant corpus should also work with the ANC. In particular, we are
        working closely with the folks at the BNC to ensure that the ANC is
        compatible with their search tool Xaira
        (http://www.oucs.ox.ac.uk/rts/xaira/index.xml).

> Also, the number of punctuation marks which are used and which ones
> in particular can have a large impact on the "meaning" of any particular
> one of them.

        Whenever possible we have left the punctuation as is. For example, various
        texts may represent double quotes with the double quote character, two
        single quotes, two back ticks (``) followed by two single quotes, or some
        other ISO character that looks similar to the double quote character. Our
        goal is to provide the raw data, as we receive it, and let more informed
        people make the "strategic choices".

        Unfortunately, you will find punctuation in some of our spoken texts. This
        is because those texts already had the punctuation when we received the
        files and removing it would just be another manipulation of the data.

I hope this helps,
Keith

At 05:54 PM 1/11/2005 -0800, you wrote:
>One cautionary note (though perhaps it is obvious):
>the clearest cases for punctuation analysis will be those drawn
>from *written* language corpora (e.g., Brown and LOB).
>Although spoken language corpora contain punctuation marks, these do
>not necessarily follow the conventions of written language, but rather
>are sometimes strategic choices for encoding prosody to some extent
>within the constraints of standard keyboards (i.e., without resorting
>to special characters).
>
>Also, the number of punctuation marks which are used and which ones
>in particular can have a large impact on the "meaning" of any particular
>one of them. (I've written on this elsewhere if of interest.)
>This point is of course partly related to Eric Atwell's point:
> > (usage depends on original sources so there is no corpus-wide
> > standardised punctuation)
>which is also important.
>
>I can't resist mentioning two important works for background lit:
>1) Quirk, et al. A Comprehensive grammar of the English language.
> London ; New York : Longman, 1985. x, 1779 p. : 26 cm.
>Everyone knows this course, but I think the sections on punctuation
>are not given nearly the attention they deserve.
>2) For punctuation in historical context, I would also recommend
>the following:
> Parkes, M. B. (Malcolm Beckwith)
> Pause and effect : an introduction to the history of punctuation
> in the West / M.B. Parkes.
> Berkeley : University of California Press, c1993.
>
>Parkes is often overlooked, but is fascinating, and full of plates
>which go all the way back to ancient texts (Greek and Latin).
>He makes a very strong point to the effect that punctuatino has served
>very different functions at different points in time, depending on
>the nature of the audience for which the text was put into writing.
>In Ancient Greece, one important use of writing was to preserve
>spoken language and help students become better orators.
>The claim is made that people didn't read silently until much later.
>
>Another point perhaps of interest: the amount of punctuation
>in the Bible varied greatly from on era to the next depending on the
>intended readership. When it was a homogenous readership (native
>speaking monks), there was less punctuation ; later on, when it
>was a more heterogenous readership in far-flung countries, there
>tended to be more punctuation per page.
>
>-Jane Edwards

        --------------------------------------------------
        Keith Suderman
        Technical Specialist
        American National Corpus
        suderman@cs.vassar.edu
        http://americannationalcorpus.org

Next message: Priscilla Rasmussen: "[Corpora-List] UK: Second ACL-SIGSEM Workshop on The Language Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications--Deadline Extension"
Previous message: Santos Diana: "[Corpora-List] HAREM: Call for participation"
Maybe in reply to: Grant, T.: "[Corpora-List] Punctuation"
Next in thread: Keith Suderman: "Re: [Corpora-List] Punctuation"
Reply: Keith Suderman: "Re: [Corpora-List] Punctuation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Jan 14 2005 - 01:02:46 MET