Re: requests for corpora

Jacques Guy (j.guy@trl.telstra.com.au)
Tue, 25 Feb 1997 10:52:09 -0800

Christopher Bader wrote:

> This list gets lots of requests for corpora.
[...]
> Consequently, some people post their responses to the list, too, wasting
> a lot of band-width, and making it hard to keep track of the answers.
[...]
> Thoughts?

Yes. I have noticed this tendency, and also those requests for
word frequencies.

Firstly, a search on good Web search engines -- AltaVista for
instance -- will usually unearth everything you ever wanted
to know, corpora included, in and about the most obscure
languages. It takes only a bit of patience.

Second, those requests for word frequencies. They are the height
of silliness, for two reasons.

One. To find the words frequencies in a language, any language,
just get a corpus of texts and count them!

Two. Word frequencies, except for the most frequent words
(grammatical particles), will vary from sample to sample.
Until one has decided what constitutes a representative
sample for one's purposes, it is futile to speak of
word frequencies. None of the requests I have seen, here,
or relayed from LINGUIST, or on sci.lang, shows an
awareness of this.

To me, those queries are laziness and a mild annoyance.