[Corpora-List] Summary: Speech corpora by register

From: L Carmichael (lesley@u.washington.edu)
Date: Sun Jun 01 2003 - 19:33:24 MET DST

  • Next message: Wang shoushou: "[Corpora-List] Summary."

    Hi Corpora List,

    Some time ago, I asked the list for leads on (American English) speech
    corpora of different registers of speech (i.e., controlled or labeled for
    context, such as 'teacher talk,' 'doctor talk,' speech directed at
    non-native speakers, lectures, casual speech between friends, etc.). Many
    people wrote to me with recommendations for corpora that would be suitable
    for text analysis. While my goal is actually to find corpora of SOUND
    files, this information is still tremendously helpful (thank you!). I
    finally present you with a summary:

    1. Santa Barbara Corpus of Spoken American English (from LDC)
    2. Switchboard (LDC)
    3. CallHome (LDC)
    4. MICASE (Michigan Corpus of Academic Spoken English) (freely available,
    searchable online - http://www.lsa.umich.edu/eli/micase/micase.htm)
    5. Saarbruecken Corpus of Spoken English (limited genres, mostly jokes)
    6. T2K-SWAL (not publicly available)
    7. Corpus of Spoken Professional English
    (http://www.athel.com/corpdes.html)
    8. Longman Grammar of Spoken and Written English (not publicly available;
    overlaps with British National Corpus and Santa Barbara corpus)
    9. British National Corpus (sound files may be available)
    10. British Academic Spoken English
    11. Dialogue Diversity Corpus (no speech files)
    http://www-rcf.usc.edu/~billmann/diversity
    12. Intonational Variation in English (IViE)
    13. The London-Lund Corpus of Spoken English
    14. The Lancaster/IBM SEC Corpus, The Machine-Readable Corpus of Spoken
    English
    15. The Wellington Corpus of Spoken New Zealand English (WSC)
    16. The Bergen Corpus of London Teenage Language (COLT)
    17. The International Corpus of English - East African component
    18. The Polytechnic of Wales Corpus (children talking)
    (13-18 from ICAME, corpora and manuals available -
    http://www.hit.uib.no/icame/cd/)
    19. CIRCLE Corpus, http://www.pitt.edu/~circle/Archive.htm
    20. TRAINS Dialogue Corpus
    http://www.cs.rochester.edu/research/cisd/resources/trains.html
    21. ICE Singapore English Corpus
    http://www-rcf.usc.edu/~billmann/diversity/ICE-SIN_Manual.PDF
    22. Corpus meta-site http://devoted.to/corpora

    Also, I want to share with you some of the comments I received:

    1. One researcher who is extracting dialogue patterns mentioned that the
    variation in annotation/markup presents problems for such work.
    2. One researcher is seeking corpora of internet chat, so please post to
    the list if you know of any!
    3. It's clear that there are more well-developed resources for British
    English than American English
    4. Actual sound files are hard to come by (*please* post of you know of
    any resources for American English speech not listed here!)
    5. The researchers who responded to me were also interested in hearing of
    other spoken American English corpora (please post if you know of others
    not mentioned herein)

    Thank you to all who helped me (David Lee, Bill Mann, Eric Atwell, Eric
    Breck)! Your detailed assistance is sincerely appreciated!

    Lesley Carmichael
    Department of Linguistics
    University of Washington



    This archive was generated by hypermail 2b29 : Sun Jun 01 2003 - 19:43:44 MET DST