Corpora: term clustering technique

TURENNE Nicolas (turenne@neurone.com)
Fri, 20 Feb 1998 10:45:39 +0100

Dear Sir,
First, thank very much to those who answered me.
I wanted to know mailing lists about statistics and corpora.
It seems that only this mailing list gathers messages in this field. One is maybe candidat:
elsenet-list@let.ruu.nl
People seem to declare that everyone use statistics. I agree that corpus is wonderful
to watch invariants by their frequency; but for me, use of frequency is not a complete approach
of data analysis. A complete approach have to encapsulate use of frequencies into an
analytical methodology.
someone said that i have to specify more the field i seek. So i am interested in approaches developping
clustering technique to gather terms with themselves for making relevant semantic classes called concepts.
I know some techniques as Kohonen neural network, descendant hierarchic classification, partitioning by chi2
or k-means method, co-word analysis, relational analysis, Cobweb approach.
I tested these methods compared to a real-world data hieracharchy to class given terms situated in a corpus.
The results of conceptual classes compared to ideal rela-world classes is not efficient. Correlation is less than 40%.(see my paper
submitted to ecml'98 workshop on text mining)
So my specific question is: do someone know some efficient data analysis method using cooccurrence matrices to class a given list of terms
contained in a given corpus?
thank you
best regards

Nicolas Turenne
------------------------------------------------------------------------------------------------------------------
Neurone Informatique/Neurocim ENSAIS/Univ Louis-Pasteur
12A rue de la Faisanderie LIIA/ERIC
67 000 Lingolsheim 24 Bld de la Victoire
tel 03 88 78 71 71 67000 Strasbourg
tel 03 88 14 47 53
http://www-ensais.u-strasbg.fr/ERIC/liia-gtln/people/turenne/welcome.html