Re: [Corpora-List] corpus transformations info - SUMMARY

From: PbIKOB_B.B. (rykov@narod.ru)
Date: Tue May 20 2003 - 08:58:57 MET DST

  • Next message: D Elliott: "[Corpora-List] Parallel texts for machine translation evaluation"

    Philosophers teach us that there are three sources of knowledge (K): R (reality or nature), M (human mind) and S (sign structures √ here texts).

    So √ corpus of texts (CT) is one of the sources of K.

     In semiotical terms we want to extract and encode semantics from CT covering a certain knowledge domain (KD). IMO we meet here at least three problems:

    1. It is hard to do it using just sign transformations √ without outer human or vocabulary/thesaurus support.
    2. It is not a single stage process. Each stage has its own specifications.
    3. The resulting K strongly depends on pragmatical goals of the user.

    I think we should take into account discussing the problem of the CT into K formalisation.

    Here are three papers discussing this problem:

    =================

    My colleague Scott Cederberg and I have worked pretty extensively on this
    problem over the last couple of years. The following 3 papers give a good
    overview:

    Learning taxonomic information directly from corpora:
    http://infomap.stanford.edu/papers/hyponymy.pdf

    Building lexical classes from "seed examples":
    http://infomap.stanford.edu/papers/lexical-graphs.ps

    Enriching an existing taxonomy / lexicon with new terms:
    http://infomap.stanford.edu/papers/enrich-taxonomies.pdf

    These methods build on earlier work, particularly by Marti Hearst, Hinrich
    Schutze, Ellen Riloff and Eugene Charniak.

    Best wishes,
    Dominic

    ===================================

    > I would be grateful for any source of info (link, paper etc.) concerning the matter of transformation of corpus covering a knowledge domain or any specified subject into any knowledge structure like thesaurus, ontology, RDF file etc.
    >
    >
    >--
    >
    > P bI K O B B.B. MOCKBA
    >
    >Vladimir Rykov, PhD in Computational Linguistics,
    > MOSCOW
    >http://rykov.narod.ru/
    >Engl. http://www.blkbox.com/~gigawatt/rykov.html
    >Tel +7-903-749-19-99
    >
    >--
    >Чистая почта - это личные письма, без спама и вирусов - http://mail.yandex.ru/monitoring. Заведите и вы себе почту на Яндексе.
    >
    >
    >

    --
    

    P bI K O B B.B. MOCKBA

    Vladimir Rykov, PhD in Computational Linguistics, MOSCOW http://rykov.narod.ru/ Engl. http://www.blkbox.com/~gigawatt/rykov.html Tel +7-903-749-19-99

    -- Быстро и чисто - вот зачем нужна почта на Яндексе (http://mail.yandex.ru/monitoring/).



    This archive was generated by hypermail 2b29 : Tue May 20 2003 - 09:03:41 MET DST