Re: [Corpora-List] Punctuation

From: Keith Suderman (suderman@cs.vassar.edu)
Date: Thu Jan 13 2005 - 19:02:09 MET

  • Next message: Constantin Orasan: "Re: [Corpora-List] Lexical Chains"

    Hello Jane,

    All of the texts in the ANC are classified as written, spoken, or written
    to be spoken. The texts are also classified in other ways (domain,
    audience, etc.), so depending on the sophistication of the search tool you
    are using it is possible to query any subset of the corpus you wish. While
    we do not supply our own search tool, any tool that works with an XML or
    TEI compliant corpus should also work with the ANC. In particular, we are
    working closely with the folks at the BNC to ensure that the ANC is
    compatible with their search tool Xaira
    (http://www.oucs.ox.ac.uk/rts/xaira/index.xml).

    > Also, the number of punctuation marks which are used and which ones
    > in particular can have a large impact on the "meaning" of any particular
    > one of them.

    Whenever possible we have left the punctuation as is. For example, various
    texts may represent double quotes with the double quote character, two
    single quotes, two back ticks (``) followed by two single quotes, or some
    other ISO character that looks similar to the double quote character. Our
    goal is to provide the raw data, as we receive it, and let more informed
    people make the "strategic choices".

    Unfortunately, you will find punctuation in some of our spoken texts. This
    is because those texts already had the punctuation when we received the
    files and removing it would just be another manipulation of the data.

    I hope this helps,
    Keith

    At 05:54 PM 1/11/2005 -0800, you wrote:
    >One cautionary note (though perhaps it is obvious):
    >the clearest cases for punctuation analysis will be those drawn
    >from *written* language corpora (e.g., Brown and LOB).
    >Although spoken language corpora contain punctuation marks, these do
    >not necessarily follow the conventions of written language, but rather
    >are sometimes strategic choices for encoding prosody to some extent
    >within the constraints of standard keyboards (i.e., without resorting
    >to special characters).
    >
    >Also, the number of punctuation marks which are used and which ones
    >in particular can have a large impact on the "meaning" of any particular
    >one of them. (I've written on this elsewhere if of interest.)
    >This point is of course partly related to Eric Atwell's point:
    > > (usage depends on original sources so there is no corpus-wide
    > > standardised punctuation)
    >which is also important.
    >
    >I can't resist mentioning two important works for background lit:
    >1) Quirk, et al. A Comprehensive grammar of the English language.
    > London ; New York : Longman, 1985. x, 1779 p. : 26 cm.
    >Everyone knows this course, but I think the sections on punctuation
    >are not given nearly the attention they deserve.
    >2) For punctuation in historical context, I would also recommend
    >the following:
    > Parkes, M. B. (Malcolm Beckwith)
    > Pause and effect : an introduction to the history of punctuation
    > in the West / M.B. Parkes.
    > Berkeley : University of California Press, c1993.
    >
    >Parkes is often overlooked, but is fascinating, and full of plates
    >which go all the way back to ancient texts (Greek and Latin).
    >He makes a very strong point to the effect that punctuatino has served
    >very different functions at different points in time, depending on
    >the nature of the audience for which the text was put into writing.
    >In Ancient Greece, one important use of writing was to preserve
    >spoken language and help students become better orators.
    >The claim is made that people didn't read silently until much later.
    >
    >Another point perhaps of interest: the amount of punctuation
    >in the Bible varied greatly from on era to the next depending on the
    >intended readership. When it was a homogenous readership (native
    >speaking monks), there was less punctuation ; later on, when it
    >was a more heterogenous readership in far-flung countries, there
    >tended to be more punctuation per page.
    >
    >-Jane Edwards

    --------------------------------------------------
    Keith Suderman
    Technical Specialist
    American National Corpus
    suderman@cs.vassar.edu
    http://americannationalcorpus.org



    This archive was generated by hypermail 2b29 : Fri Jan 14 2005 - 14:47:04 MET