Re: [Corpora-List] corpus ------>>>>> thesaurus

From: Rob Koeling (robk@sussex.ac.uk)
Date: Fri Nov 12 2004 - 18:07:35 MET

  • Next message: caterina.vestito@tiscali.it: "[Corpora-List] corpus of tourist texts"

    Hello Vladimir,

    I am working on creating domain specific thesauruses at the moment.
    Creating these thesauruses is not a goal in itself, but a means to
    create domain specific rankings of word senses. I am working on ranking
    work senses with Diana McCarthy, John Carroll and Julie weeds. You can
    read more on this in our ACL-2004 paper.

    In this paper we describe an experiment with specific sense rankings for
    the Sports and Finance domain. We created a corpus with Sports and one
    with Finance texts using the Reuters corpus. We used all the documents
    in the Reuters corpus with a Sports label (topic code GSPO) and I think
    about a third of the Finance related texts (topic codes ECAT nad MCAT).
    The resulting corpora were 9.1 million words and 32.5 million words
    respectively. We created the thesauruses using Lin's method. You can
    find the details of how we created the thesauruses in the paper. I'm not
    sure if we can distribute the resulting thesauruses. I'll have to look
    at the Reuters license.

    The articles in the Reuters corpus are hand tagged, so the resulting
    domain specific corpora should be high quality. Unfortunately there is
    very little hand annotated data available. At the moment I am setting up
    an experiment to harvest texts from the web in order to create domain
    specific corpora. We have selected some 40 different domains (from the
    Subject Field Codes hierarchy, see ref. in ACL paper) and created a text
    classifier for these domains. These corpora will be used to create
    domain specific thesauruses. We want to use these thesauruses to create
    specific word sense rankings for all these domains.

    I can't say anything yet about how high the quality of these domain
    corpora will be. I hope to be able to say more about this in a couple of
    months. I don't see any reasons why we wouldn't be able to share these
    thesauruses.

    Best,

      - Rob Koeling

    On Tue, 9 Nov 2004, P bI K O B___ B.B. (MOCKBA) wrote:

    >
    > I would be very grateful to anyone for any info concerning
    compiling thesaurus from corpus (esp. from corpus of specific domain
    documents).
    >
    > As example - thesaurus of financial terms compiled from financial
    documents corpus.
    >
    > Best wishes to all our corpus society !
    >
    > --
    > Regards Vladimir Rykov
    >
    > PhD in Computational Linguistics
    > Personal web-site: rykov.narod.ru
    > mailto: rykov2000@mail.ru
    > Si etiam omnes - ego non
    > English version: www.blkbox.com/~gigawatt/rykov.html
    >
    > --
    > Яндекс.Игрушки - яркий перерыв в серых трудовых буднях. http://play.yandex.ru/
    >
    >
    >



    This archive was generated by hypermail 2b29 : Sun Nov 14 2004 - 22:42:14 MET