Re: [Corpora-List] corpus transformations info - SUMMARY

From: PbIKOB_B.B. (rykov@narod.ru)
Date: Tue May 20 2003 - 08:58:57 MET DST

Next message: D Elliott: "[Corpora-List] Parallel texts for machine translation evaluation"

Previous message: Linguistic Data Consortium: "[Corpora-List] New LDC Publications"
In reply to: PbIKOB_B.B.: "[Corpora-List] corpus transformations info"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Philosophers teach us that there are three sources of knowledge (K): R (reality or nature), M (human mind) and S (sign structures √ here texts).

So √ corpus of texts (CT) is one of the sources of K.

In semiotical terms we want to extract and encode semantics from CT covering a certain knowledge domain (KD). IMO we meet here at least three problems:

1. It is hard to do it using just sign transformations √ without outer human or vocabulary/thesaurus support.
2. It is not a single stage process. Each stage has its own specifications.
3. The resulting K strongly depends on pragmatical goals of the user.

I think we should take into account discussing the problem of the CT into K formalisation.

Here are three papers discussing this problem:

=================

My colleague Scott Cederberg and I have worked pretty extensively on this
problem over the last couple of years. The following 3 papers give a good
overview:

Learning taxonomic information directly from corpora:
http://infomap.stanford.edu/papers/hyponymy.pdf

Building lexical classes from "seed examples":
http://infomap.stanford.edu/papers/lexical-graphs.ps

Enriching an existing taxonomy / lexicon with new terms:
http://infomap.stanford.edu/papers/enrich-taxonomies.pdf

These methods build on earlier work, particularly by Marti Hearst, Hinrich
Schutze, Ellen Riloff and Eugene Charniak.

Best wishes,
Dominic

===================================

> I would be grateful for any source of info (link, paper etc.) concerning the matter of transformation of corpus covering a knowledge domain or any specified subject into any knowledge structure like thesaurus, ontology, RDF file etc.
>
>
>--
>
> P bI K O B B.B. MOCKBA
>
>Vladimir Rykov, PhD in Computational Linguistics,
> MOSCOW
>http://rykov.narod.ru/
>Engl. http://www.blkbox.com/~gigawatt/rykov.html
>Tel +7-903-749-19-99
>
>--
>Чистая почта - это личные письма, без спама и вирусов - http://mail.yandex.ru/monitoring. Заведите и вы себе почту на Яндексе.
>
>
>

P bI K O B B.B. MOCKBA

Vladimir Rykov, PhD in Computational Linguistics, MOSCOW http://rykov.narod.ru/ Engl. http://www.blkbox.com/~gigawatt/rykov.html Tel +7-903-749-19-99

-- Быстро и чисто - вот зачем нужна почта на Яндексе (http://mail.yandex.ru/monitoring/).

Next message: D Elliott: "[Corpora-List] Parallel texts for machine translation evaluation"
Previous message: Linguistic Data Consortium: "[Corpora-List] New LDC Publications"
In reply to: PbIKOB_B.B.: "[Corpora-List] corpus transformations info"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue May 20 2003 - 09:03:41 MET DST