[Corpora-List] Punctuation follow up

From: Grant, T. (tg21@leicester.ac.uk)
Date: Wed Jan 12 2005 - 11:18:26 MET

  • Next message: Marc Kupietz: "Re: [Corpora-List] Q: How to identify duplicates in a largedocument collection"

    Many thanks to:

    Eric Atwell
    Christopher Brewster
    Gaël Dias
    Jane Edwards
    Nancy Ide
    Raf Salkie
    & Dominic Widdows

    The corpora I've been referred to are

    BNC http://www.natcorp.ox.ac.uk/
    The Susanne Corpus http://www.grsampson.net/Resources.html info: http://www.grsampson.net/RSue.html
    and the American National Corpus (through its XML coding)

    There were also various comments suggesting that many corpora coded punctuation but retrieval could be tricky, methods were generally corpus specific but one more general suggestion was using the Java BreakIterator Class.

    Cautions and comments included watching out for differences between spoken and written English and also American and British English.

    For readings I was referred to:
    Quirk, et al. A Comprehensive grammar of the English language.
         London ; New York : Longman, 1985. x, 1779 p. : 26 cm.
    &
    Parkes, M. B. (Malcolm Beckwith) Pause and effect : an introduction to the history of punctuation in the West / M.B. Parkes.Berkeley : University of California Press, c1993.

    Actually identifying the use of scare quotes from all the other uses of ' & " marks is tricky but I'm getting a high proportion using a single word separator between marks e.g { " _ " }

    Thank you again

    Tim

    ______________________________________
    Tim Grant
    Forensic Section - School of Psychology
    University of Leicester
    106 New Walk
    Leicester LE1 7EA
    UK

    TG21@leicester.ac.uk
    http://www.le.ac.uk/psychology/tg21/

    + 44(0)116 252 3658 (Direct Line) - + 44(0)116 252 2451 (Secretary) - + 44(0)116 252 3994 (Fax)



    This archive was generated by hypermail 2b29 : Wed Jan 12 2005 - 12:03:09 MET