Re: [Corpora-List] Looking for sentence similarity corpus

From: devans@cs.columbia.edu
Date: Thu Sep 23 2004 - 16:23:00 MET DST

  • Next message: lebron letchev: "[Corpora-List] Corpus bank of English Linguistic Software Applications ."

    > Dear Corpora members,
    >
    > I'm looking for a sentence-similarity corpus, i.e., a collection of
    > sentences with manually assigned similarities to other sentences. Any
    > ideas?
    >
    > Thanks in advance,
    > Gilad
    >
    >
    > --
    > Informatics Institute * University of Amsterdam
    > Kruislaan 403 * 1098 SJ Amsterdam * The Netherlands
    > http://ilps.science.uva.nl * +31 20 525 6731/7561/7490 (fax)

    Hello Gilad,

      We have a small corpus like that at Columbia that we used to train
    SimFinder. It is a set of 8 clusters of documents, with similar
    sentences marked within the clusters. The sentences were marked by two
    people, and they later adjudicated their markup until the two judges
    agreed on the annotation. Sentences are marked as either similar or not
    similar. There is a total of 34 articles over the 8 clusters, the
    entire training set has about 20,000 sentences, 480 of them are marked
    as similar.

      Let me know if you have any questions; I believe we can release this
    data, but I might have to look into it a bit.

    Dave



    This archive was generated by hypermail 2b29 : Thu Sep 23 2004 - 19:01:11 MET DST