Re: [Corpora-List] Looking for sentence similarity corpus

From: devans@cs.columbia.edu
Date: Thu Sep 23 2004 - 16:23:00 MET DST

Next message: lebron letchev: "[Corpora-List] Corpus bank of English Linguistic Software Applications ."

Previous message: Jason Eisner: "[Corpora-List] Call for Proposals: JHU Summer Workshop on Language Engineering"
In reply to: Gilad Mishne: "[Corpora-List] Looking for sentence similarity corpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> Dear Corpora members,
>
> I'm looking for a sentence-similarity corpus, i.e., a collection of
> sentences with manually assigned similarities to other sentences. Any
> ideas?
>
> Thanks in advance,
> Gilad
>
>
> --
> Informatics Institute * University of Amsterdam
> Kruislaan 403 * 1098 SJ Amsterdam * The Netherlands
> http://ilps.science.uva.nl * +31 20 525 6731/7561/7490 (fax)

Hello Gilad,

We have a small corpus like that at Columbia that we used to train
SimFinder. It is a set of 8 clusters of documents, with similar
sentences marked within the clusters. The sentences were marked by two
people, and they later adjudicated their markup until the two judges
agreed on the annotation. Sentences are marked as either similar or not
similar. There is a total of 34 articles over the 8 clusters, the
entire training set has about 20,000 sentences, 480 of them are marked
as similar.

Let me know if you have any questions; I believe we can release this
data, but I might have to look into it a bit.

Dave

Next message: lebron letchev: "[Corpora-List] Corpus bank of English Linguistic Software Applications ."
Previous message: Jason Eisner: "[Corpora-List] Call for Proposals: JHU Summer Workshop on Language Engineering"
In reply to: Gilad Mishne: "[Corpora-List] Looking for sentence similarity corpus"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Sep 23 2004 - 19:01:11 MET DST