Re: Seeking word stemmer for French

asmeaton@CompApp.DCU.IE
Mon, 23 Dec 1996 12:29:18 GMT

On 27 November I posted this to this list:
!
! I'm looking for a word stemmer for French ... usual conditions of
! being free and with no strings attached. We have been doing some
! work here at Dublin City University on information retrieval on
! large collections of text (250 Mbytes) as part of TREC. (One of)
! our approaches represents documents and queries by their character
! shape codes and we have found it works surprisingly well for English.
! We would like to see if it works at all for French as part of TREC-6,
! which will have a French language IR task.
!

I got one suggestion, and quite a few "please keep me informed ... I'm
looking for one too !" and even more "what is TREC and how can I get
that data".

The suggestion was:

! From: Achim Stein <achim@chianti.philosophie.uni-stuttgart.de>
!
! For French, you might want to try Helmut Schmid's TreeTagger together
! with my French morphology. Input is French text, output is text with
! POS-Tags and Lemmas.
!
! Retrieval + more information from:
!
! http://www.ims.uni-stuttgart.de/Tools/DecisionTreeTagger.html
!
! Scripts for tokenization designed for the Tagger are also
! available. If you're interested in this system, contact me.
!

For those who want to know about TREC I refer you to
http://potomac.ncsl.nist.gov/TREC/

The call for participation in TREC-6 (during 1997) is attached (apologies
to those of you who have seen this already but many have not and it is
informative.

- Alan Smeaton
- Dublin City University

CALL FOR PARTICIPATION

TEXT RETRIEVAL CONFERENCE

January 1997 - November 1997


Conducted by:
National Institute of Standards and Technology
(NIST)

Sponsored by:
Defense Advanced Research Projects Agency
Software and Intelligent Systems Technology Office
(DARPA/SISTO)

The Text Retrieval Conference (TREC) workshop series encourages research
in information retrieval from large text applications by providing a
large test collection, uniform scoring procedures, and a forum for
organizations interested in comparing their results. Now in its sixth
year, the conference has become the major experimental effort in the field.
Participants in the first five TREC conferences have examined a wide variety
of retrieval techniques, including methods using automatic thesauri,
sophisticated term weighting, natural language techniques, relevance feedback,
and advanced pattern matching. You are invited to submit a proposal for
participation in TREC-6.

TREC has two main tasks, ad hoc and routing retrieval. The ad hoc task
investigates the performance of systems that search a static set of documents
using new user need statements ("topics"); the routing task investigates
the performance of systems that use standing queries to search new streams
of documents. In addition, TREC has smaller "tracks" that allow participants
to focus on particular subproblems of the retrieval task. Participants will
be expected to work with approximately a million documents (2 gigabytes of
data), retrieving lists of ranked documents in response to the topics.
NIST will distribute the data and will collect and analyze the results.

Dissemination of TREC work and results other than in the (publically
available) conference proceedings is welcomed, but the conditions of
participation preclude specific advertising claims based on TREC results.
As before, the workshop in November will be open only to participating
groups that submit results and to government sponsors.

Schedule:
Jan. 6, 1997 -- deadline for participation applications
February 1 -- acceptances announced, and permission forms for data
distributed to new participants. The training documents come as
4 CD-ROMS containing about 4 gigabytes of data. In addition, 300
training topics and relevance judgments are available via a
(protected) ftp site.
April 1 -- NIST target date for availability of new documents to be
used for the ad hoc task
May 1 -- list of routing topics distributed
June 1 -- routing queries due at NIST; test data available for routing
distributed to groups after routing queries received by NIST
June 1 -- 50 new test topics for ad hoc test distributed
August 15 -- results from routing and ad hoc tasks due at NIST
September 1 -- results from the tracks due at NIST
October 1 -- main task relevance judgments and individual evaluation
scores due back to participants
Nov. 19-21 -- TREC-6 conference at NIST in Gaithersburg, Md.

Task Description:
Below is a brief summary of the tasks. For more details, and for
samples of the topics and documents, see the online version of the TREC-4
proceedings (http://potomac.ncsl.nist.gov/trec).

Main tasks (ad hoc and routing)
Participants will receive 4 gigabytes of data for use in training their
systems, including development of appropriate algorithms or knowledge bases.
The 300 topics used in the first five TREC workshops and the relevance
judgments for those topics will also be available via ftp. The topics are in
the form of a formatted user need statement. Queries can either be constructed
automatically from this topic description, or can be manually constructed.

Two types of retrieval operations will be tested: a routing operation against
new data, and an ad hoc query operation against archival data. Fifty of the
topics (selected from the 300 topics distributed for training) will be used
in the routing task to create formalized queries to be used for retrieval
against new test data. Fifty new test topics (301-350) will be used as ad hoc
queries against TREC disk 4 and a new disk to be distributed in April.

Results from both types of queries (routing and ad hoc) will be submitted
to NIST as the ranked top 1000 documents retrieved for each query. Scoring
techniques including traditional recall/precision measures will be run for
all systems and individual results will be returned to each participant.

Track tasks
The goal of the tracks is to investigate areas tangential to the main tasks,
or to investigate areas that are more focussed than the main tasks. A very
brief summary of each of the tracks to be run in TREC-6 is given below.
The exact definition of the tracks in TREC-6 is still being defined by
interested participants, and details of the track should be obtained from the
designated contact person.

Chinese -- An ad hoc task where topics and documents are in Chinese.
Twenty-five topics with relevance assessments are available from
TREC-5. The TREC-6 track will use 25 new topics and the same
document set (articles taken from the Xin Hua newswire and
the People's Daily newspaper, about 250 megabytes of text).
Contact person: Ross Wilkinson (ross@cs.rmit.edu.au)

Cross-Language "Pre-Track" -- An ad hoc task in which some documents
are in English, some in German, and others in French. Topics will
each be in all three languages. The focus of the track will be to
retrieve documents that pertain to the topic regardless of language.
Contact person: Peter Schauble (schauble@inf.ethz.ch)

Filtering -- A task similar to the routing task but one in which the system
must make a binary decision as to whether the current document
should be retrieved (as opposed to forming a ranked list).
Contact person: David Hull (hull@grenoble.rxrc.xerox.com)

High Precision User Track -- An ad hoc task in which participants are given
five minutes per topic to produce a retrieved set using any means
desired (e.g., through user interaction, completely automatically).
Contact person: Chris Buckley (chrisb@sabir.com)

Interactive -- A task used to study user interaction with text retrieval
systems. The design of this track will be very similar to
the design proposed for TREC-5.
Contact person: Steve Robertson (ser@is.city.ac.uk)

NLP -- An ad hoc task that investigates the contribution natural language
processing techniques can make to IR systems. For TREC-6, this
track is likely to be influenced by work in TIPSTER phase III.
Contact person: Tomek Strzalkowski (tomek@thuban.crd.ge.com)

Speech Retrieval "Pre-Track" -- The initial offering of a track that will
investigate retrieving spoken documents. This track is being
offered with the support of the speech group at NIST. Speech groups
interested in producing transcripts of news broadcasts,
retrieval groups interested in retrieving specific documents
from the transcripts so produced, and combination groups that
will retrieve documents from the speech itself are all encouraged
to participate.
Contact person: Ellen Voorhees (ellen.voorhees@nist.gov)

Very Large Corpus (VLC) -- An ad hoc task that investigates the ability of
retrieval systems to handle larger amounts of data. Target
corpus size is approximately 20 gigabytes.
Contact person: David Hawking (dave@cs.anu.edu.au).

Groups may participate in any or all or no tracks. Groups are very strongly
encouraged to participate in the main tasks, particularly those that serve
as baselines for the tracks they participate in.

Conference Format:
The conference itself will be used as a forum both for presentation of
results (including failure analyses and system comparisons), and for more
lengthy system presentations describing retrieval techniques used,
experiments run using the data, and other issues of interest to researchers
in information retrieval. As there is a limited amount of time for these
presentations, the program committee will determine which groups are asked to
speak and which groups will present in a poster session. Additionally some
organizations may not wish to describe their proprietary algorithms, and
these groups may chose to participate in a different manner (see Category C).
To allow a maximum number of participants, the following three categories
have been established.

Category A: Full participation
Participants will be expected to work with the full data set, and to present
full details of system algorithms and various experiments run using the data,
either in a talk or in a poster session.

Category B: Exploratory groups
Because small groups with novel retrieval techniques might like to
participate but may have limited research resources, a category has been set
up to work with only a subset of the data. Category B participants
may work with any amount of the training data they choose, and will
test their systems using the 1/2 gigabyte of Financial Times documents
on disk 4 (and all test topics). Participants in this category will be
expected to follow the same schedule as category A, except with less data.
New participants are encouraged to work in category B unless they have
experience with larger data sets.

Category C: Evaluation only
Participants in this category will be expected to work on the full data set,
submit results for common scoring and tabulation, and present their results in
a poster session. They will not be expected to describe their systems in
detail but will be expected to report on time and effort statistics.

Data (Test Collection):
The training collection (documents, topics, and relevance judgments) is an
extension of the collection (English only) used for the DARPA TIPSTER project.
Parts of the training collection was assembled from Linguistic Data Consortium
text, and a signed User Agreement will be required from all participants.
The documents are an assorted collection of newspapers (including the Wall
Street Journal), newswires, journals, technical abstracts and email newsgroups.
A separate Agreement is needed for the data assembled for TREC-5 (disk 4) and
the new data to be distributed in April. All documents will be typical of
those seen in a real-world situation (i.e. there will not be arcane vocabulary,
but there may be missing pieces of text or typographical errors). The relevance
judgments against which each system's output will be scored will be made by
experienced relevance assessors based on the output of all TREC participants
using a pooled relevance methodology.

Response format and submission details:
Organizations wishing to participate in TREC-6 should respond to this
call for participation by submitting a summary of their text retrieval
approach, not to exceed two pages in length. The summary should include the
strengths and significance of their approach to text retrieval, and highlight
differences between their approach and other retrieval approaches. Groups that
have participated in TREC-5 need to provide only two paragraphs, one
describing their methods in TREC-5 and a second describing their plans for
TREC-6.

In addition to the system summary, each organization should indicate in
which category they wish to participate (category A, B, or C). Groups new
to TREC should briefly describe their abilities to handle this large amount
of data. Please also specify which main tasks and which tracks your group
plans to participate in, and the person to whom correspondence should be
directed. A full regular address, telephone number, and an email address
is needed. EMAIL IS THE ONLY METHOD OF COMMUNICATION in TREC. The proposal
should be in ascii so that it can easily be distributed to the program
committee --- detailed diagrams are not necessary.

All responses should be submitted by Jan. 6, 1997 to Ellen Voorhees,
TREC project leader at
ellen.voorhees@nist.gov
Any questions about conference participation, response format, etc. should
be sent to the same address.

Selection of participants:
All participants must be able to demonstrate their ability to work with the
data collection (either the full collection or the subset). The program
committee will be looking for as wide a range of text retrieval approaches as
possible, and will select the best representatives of these approaches as
speakers at the conference.

Program Committee
Donna Harman, NIST, chair
Nick Belkin, Rutgers University
Chris Buckley, Sabir Research, Inc.
Jamie Callan, University of Massachusetts, Amherst
Susan Dumais, Bellcore
Darryl Howard, U.S. Department of Defense
David Hull, Rank Xerox Research Center
David Lewis, AT&T Research
John Prange, U.S. Department of Defense
Steve Robertson, City University, UK
Peter Schauble, Swiss Federal Institute of Technology
Alan Smeaton, Dublin City University, Ireland
Karen Sparck Jones, Cambridge University, UK
Richard Tong, Sageware, Inc.
Howard Turtle, West Publishing
Ellen Voorhees, NIST
Ross Wilkinson, MDS at RMIT