Re: Corpora: Reference

From: clark; alexander (asc@aclark.demon.co.uk)
Date: Tue Feb 13 2001 - 09:31:42 MET

  • Next message: clark; alexander: "Re: Corpora: Reference"

    "Melamed, Dan" wrote:
    >
    > I don't know of any rigorous study on this topic, but the claim would follow
    > from two observations:
    >
    > 1. Any text corpus is but a sample of some (sub)language. As the sample
    > grows, it comes closer and closer to representing the whole population. The
    > WSJ has been around for quite a while, so it's likely to have used all of
    > the words in its (sub)language by now.
    >
    > 2. New words keep entering the (sub)language. 20 new words per month would
    > not be surprising, even if you exclude proper nouns and technospeak.
    >
    > IDM
    >

    I think these observations presuppose that at any given moment a
    language or sub-language
    has a well-defined finite set of words in it. I am not sure I would
    agree with this, even if you consider
    an individual idiolect, given the productivity of certain morphological
    rules (eg writing re-writing re-rewriting ... and so on), and other word
    formation processes.

    More generally this relates to the various observations about Zipfian
    distributions in the lexicon made by e.g. Baayen, Gazdar and so on.

    -- 
    Alexander Clark  asc@aclark.demon.co.uk  
    Alex.Clark@issco.unige.ch ISSCO / TIM, Ecole de Traduction et
    d'Interprétation,
    University of Geneva, Boulevard du Pont-d'Arve, CH-1211 Genève 4
    Tel: (+41) 022 7058682 Fax: 7058689
    



    This archive was generated by hypermail 2b29 : Tue Feb 13 2001 - 09:28:02 MET