[Corpora-List] Summary: Corpus with restricted vocabulary

From: Klebanov Beata (beata@cs.huji.ac.il)
Date: Tue Jan 27 2004 - 10:19:26 MET

  • Next message: Staffan Hermansson: "[Corpora-List] Sentence ambiguator/splitter"

    Dear Corpora members,

    This is a summary of replies I received to my query from 18 Jan:

    >
    > For my research on textual manifestations of common knowledge, I am
    > looking for a corpus of short English texts based on restricted vocabulary
    > (up to ~500 different NP, VP heads), to be used for training machine
    > learning tools sensitive to vocabulary size.

    I would like to thank Brett Reynolds, Eric Atwell, Joel Walters and Andrew
    Harley for providing pointers.

    Here is the summary of replies:

    (1) Andrew Harley <aharley@cambridge.org>
    from Cambridge University Press suggested using
    learner's dictionaries that have definitions based on restricted vocabulary;
    for example, Cambridge learner dictionary that can be licensed. More info here:
    http://dictionary.cambridge.org/researchers.htm

    He also suggested using ELT readers at different levels that might meet
    the restricted vocabulary requirement. The first level restricts the
    vocabulary to 400 headwords; at his level, there are 6 books of about 30
    pages including pictures. It is possible to view samples from the readers
    here: http://publishing.cambridge.org/ge/elt/readers/26777/
    Readers have not yet been licensed for use as a corpus, but
    Andrew Harley thinks it might be possible if there is a demand and if the
    authors agree.

    In a similar spirit, Brett Reynolds <brett@forsyths.ca> suggested Oxford
    Bookworms Series of Graded Readers; more information can be found here:
    http://www.oup.com/elt/global/catalogue/readers/
    Some short samples are available from the site.

    (2) Joel Walters <waltej@netvision.net.il> has a small corpus of native
    English texts collected for an experimental procedure involving writing
    syntheses/summaries of two source texts. The corpus totals about 20,000
    words and individual texts range from 50-600 words.

    (3) Eric Atwell referred me to Dr Caroline Lyon of University of
    Hertfordshire <C.M.Lyon@herts.ac.uk> who used a restricted English Corpus
    for her PhD from 1994: http://homepages.feis.herts.ac.uk/~comrcml/Lyon-thesis.ps

    Thanks to all who replied,

    Beata.



    This archive was generated by hypermail 2b29 : Tue Jan 27 2004 - 10:18:12 MET