RE: Corpora: Reference

From: Melamed, Dan (Dan.Melamed@westgroup.com)
Date: Mon Feb 12 2001 - 18:28:01 MET

  • Next message: Tatjana Djurovic: "Corpora: Aphasic Speech"

    I don't know of any rigorous study on this topic, but the claim would follow
    from two observations:

    1. Any text corpus is but a sample of some (sub)language. As the sample
    grows, it comes closer and closer to representing the whole population. The
    WSJ has been around for quite a while, so it's likely to have used all of
    the words in its (sub)language by now.

    2. New words keep entering the (sub)language. 20 new words per month would
    not be surprising, even if you exclude proper nouns and technospeak.

    IDM

    > -----Original Message-----
    > From: Mari Olsen [mailto:molsen@microsoft.com]
    > Sent: Monday, February 12, 2001 10:51 AM
    > To: corpora@hd.uib.no
    > Cc: John Nave
    > Subject: Corpora: Reference
    >
    >
    > Can anyone provide a reference for a purported study, in which someone
    > analyzed the Wall Street Journal for new words, the number of
    > which tailed
    > off to 20 words per (month? week?) after a certain point? Or
    > is this an NLP
    > urban legend? A colleague recalls Mitch Marcus pointing out
    > that the rate of
    > new word occurrences does not asymptote but rather continues
    > at some small
    > but non-trivial rate, but not whether this is Marcus' own study, an
    > observation, or a reference to another work.
    >
    > Thanks,
    >
    > Mari Olsen
    > Microsoft-Natural Language Group
    >



    This archive was generated by hypermail 2b29 : Mon Feb 12 2001 - 18:25:03 MET