Re: Corpora: Reference

From: clark; alexander (asc@aclark.demon.co.uk)
Date: Tue Feb 13 2001 - 09:31:42 MET

Next message: clark; alexander: "Re: Corpora: Reference"

Previous message: Jem Clear: "Corpora: Mark Mitchus"
Maybe in reply to: Mari Olsen: "Corpora: Reference"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

"Melamed, Dan" wrote:
>
> I don't know of any rigorous study on this topic, but the claim would follow
> from two observations:
>
> 1. Any text corpus is but a sample of some (sub)language. As the sample
> grows, it comes closer and closer to representing the whole population. The
> WSJ has been around for quite a while, so it's likely to have used all of
> the words in its (sub)language by now.
>
> 2. New words keep entering the (sub)language. 20 new words per month would
> not be surprising, even if you exclude proper nouns and technospeak.
>
> IDM
>

I think these observations presuppose that at any given moment a
language or sub-language
has a well-defined finite set of words in it. I am not sure I would
agree with this, even if you consider
an individual idiolect, given the productivity of certain morphological
rules (eg writing re-writing re-rewriting ... and so on), and other word
formation processes.

More generally this relates to the various observations about Zipfian
distributions in the lexicon made by e.g. Baayen, Gazdar and so on.

-- 
Alexander Clark  asc@aclark.demon.co.uk  
Alex.Clark@issco.unige.ch ISSCO / TIM, Ecole de Traduction et
d'Interprétation,
University of Geneva, Boulevard du Pont-d'Arve, CH-1211 Genève 4
Tel: (+41) 022 7058682 Fax: 7058689

Next message: clark; alexander: "Re: Corpora: Reference"
Previous message: Jem Clear: "Corpora: Mark Mitchus"
Maybe in reply to: Mari Olsen: "Corpora: Reference"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Feb 13 2001 - 09:28:02 MET