(no subject)

root@bham.ac.uk
Thu, 28 Nov 1996 15:55:50 +0000

Received: from noralf.uib.no by bham.ac.uk with SMTP (PP); Sat, 2 Nov 1996 18:16:29 +0000
Received: from uib.no by noralf.uib.no id <26041-0@noralf.uib.no>;
Sat, 2 Nov 1996 19:09:59 +0100
Old-Received: from nora.hd.uib.no by noralf.uib.no with SMTP (PP); Sat, 2 Nov
1996 19:09:45 +0100
Old-Received: from deacon.cogsci.ed.ac.uk (deacon.cogsci.ed.ac.uk
[129.215.144.7]) by nora.hd.uib.no (8.7.5/8.7.3) with SMTP id
TAA25253 for <corpora@hd.uib.no>; Sat, 2 Nov 1996 19:09:39 +0100
(MET)
Old-Received: from dialup-2.cogsci.ed.ac.uk (dialup-2.cogsci.ed.ac.uk
[129.215.144.42]) by deacon.cogsci.ed.ac.uk (8.6.10/8.6.12) with
SMTP id SAA16262; Sat, 2 Nov 1996 18:09:20 GMT
Date: 3 Nov 96 18:07:14 +0000
Subject: Re: similarity vector of a passage
Reply-To: Chris Brew <chrisbr@cogsci.ed.ac.uk>
From: Chris Brew <chrisbr@cogsci.ed.ac.uk>
To: Yaakov Yaari <yyaari@netvision.net.il>
Cc: corpora@hd.uib.no, or@mail.netvision.net.il
X-Mailer: Cyberdog/1.1
MIME-Version: 1.0
Message-Id: <AEA28FDB-A080@129.215.144.42>
Content-Type: multipart/alternative; boundary="Cyberdog-AltBoundary-00009E51"
Content-Transfer-Encoding: 7bit
Sender: owner-corpora@lists.uib.no
Precedence: bulk
Resent-Date: Sat, 2 Nov 1996 19:09:59 +0100
Resent-From: corpora-request@lists.uib.no

--Cyberdog-AltBoundary-00009E51
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

>Could you give me an advice on this: I want to form a similarity
vector
>(in the Salton sense) for a passage of text, so I can compare it to
>another psssage in the same document.
>
>At the moment I am using a simplistic approach where a tf and idf of
a
>word is fixed within a document and depends on the document
collection.
>I have gathered a collection of articles which make up my collection
>(15K words total).
>
>Any ideas?
>
>Yaakov Yaari

You should certainly read about Martti Hearst's TextTiling approach to
the problem.
Look on the cmp-lg server at http://xxx.lanl.gov/ to find that paper.
The tf/idf
approach is known to work pretty well for document retrieval, but it
isn't very
clear why this is, so it is difficult to assess how well the method
will transfer
to the task of section retrieval. I believe Richard Sutcliffe, in
LImerick, did some
work on getting the "right" section of computer manuals using IR
methods.

One of the more obvious questions is: how do you decide where passages
begin and
end. Do you have a ready made answer in your application?

C

---------------------------------------------------
This message was created and sent using the Cyberdog Mail System
---------------------------------------------------

--Cyberdog-AltBoundary-00009E51
Content-Type: multipart/mixed; boundary="Cyberdog-MixedBoundary-00009E51"
Content-Transfer-Encoding: 7bit

--Cyberdog-MixedBoundary-00009E51
Content-Type: text/enriched; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<X-FONTSIZE><PARAM>12</PARAM><FONTFAMILY><PARAM>Geneva</PARAM>>Could
you give me an advice on this: I want to form a similarity vector

>(in the Salton sense) for a passage of text, so I can compare it to

>another psssage in the same document.

>

>At the moment I am using a simplistic approach where a tf and idf of
a

>word is fixed within a document and depends on the document
collection.

>I have gathered a collection of articles which make up my collection

>(15K words total).

>

>Any ideas?

>

>Yaakov Yaari</FONTFAMILY></X-FONTSIZE><SMALLER><X-FONTSIZE><PARAM>10</=
PARAM><FONTFAMILY><PARAM>Geneva</PARAM>

You should certainly read about Martti Hearst's TextTiling approach to
the problem.

Look on the cmp-lg server at http://xxx.lanl.gov/ to find that paper.
The tf/idf

approach is known to work pretty well for document retrieval, but it
isn't very

clear why this is, so it is difficult to assess how well the method
will transfer

to the task of section retrieval. I believe Richard Sutcliffe, in
LImerick, did some

work on getting the "right" section of computer manuals using IR
methods.

One of the more obvious questions is: how do you decide where passages
begin and

end. Do you have a ready made answer in your application?

C

---------------------------------------------------

This message was created and sent using the Cyberdog Mail System

---------------------------------------------------

</FONTFAMILY></X-FONTSIZE></SMALLER>
--Cyberdog-MixedBoundary-00009E51--

--Cyberdog-AltBoundary-00009E51--