Corpora: CFP: FIRST AUTOMATIC TEXT SUMMARIZATION CONFERENCE (SUMMAC)

Leo Obrst (obrst@mitre.org)
Mon, 12 Jan 1998 15:09:28 -0500

CALL FOR PARTICIPATION

**** Short Suspense ****

FIRST AUTOMATIC TEXT SUMMARIZATION CONFERENCE (SUMMAC)

Sponsored by:
The TIPSTER Text Program of the
Defense Advanced Research Projects Agency
Information Technology Office
(DARPA/ITO)

The high level of interest in automatic text summarization is evident
in the proliferation of research and commercial product development.
This Summarization Conference, conducted under the auspices of the
DARPA TIPSTER Text Program, will provide an independent forum in which
to investigate the appropriateness of automatically generated summaries
to specific tasks using using shared data and evaluation methods. The
tasks have been defined to model real-world activities and are not
geared towards any particular summarization technique or technology.

1. GENERAL INFORMATION

The goals of the first evaluation are to
- provide researchers, potential sponsors and customers with a
quantitative means to appreciate the strengths and weaknesses of the
technologies,
- gain a better understanding of the issues involved in building and
evaluating summarization systems and
- guide the direction of the research to requirements of real world
tasks

The tasks selected address the following types of summaries:

Task Summary type
---------------------------------------------------
categorization Generic, indicative
adhoc Query-based, indicative
question-and-answer Query-based, informative

For this evaluation, these summary types are defined as follows:
Generic summaries capture the main theme(s) in a document. Query-based
summaries capture a specific theme indicated by the query or topic of
interest. Indicative summaries provide some overview of the content of
the full text, but are not intended to replace it. Informative
summaries capture the relevant details of the full text and serve as an
adequate substitute.

Each organization may participate in the task(s) most appropriate for
its approach.

Schedule:
1/12/98 - Call for participation issued
1/15/98 - Training data available (upon receipt of statement of interest
and participation agreement) *
2/1/98 - Deadline for participation
2/10/98 - Test data available
2/16/98 - Summaries due back
5/4/98 - Results available, reported in conjunction with TIPSTER workshop,
to which all participants will be invited.

* Use of the LDC and TREC (TIPSTER) data requires signed license
agreements for both LDC and NIST. Details are given below.
Organizations interested in participating who have not already signed
such agreements should attend to this immediately.

The workshop will consist primarily of presentations and discussions
of innovative techniques, system design, and test results. Attendance
at the conference is limited to evaluation participants and to guests
invited by the DARPA TIPSTER Text Program. Any papers and test results
will be included in the TIPSTER workshop proceedings.

The evaluation will consist of three tasks: categorization, simulated
adhoc retrieval, and question-and-answer (Q&A).

The data for the evaluation will be generated from the TREC/TIPSTER
collections, disks 1-5, all sources and TREC topics. Disks 4 and 5 are
available from NIST upon signing a license agreement available from
http://www.trec.nist.gov. Disks 1-3 are available from the LDC with a
membership. (See http://www.ldc.upenn.edu/ldc/index.html.)

2. INDICATIVE SUMMARIES: CATEGORIZATION AND ADHOC TASKS

2.1 Categorization task:

The goal of the categorization task is to evaluate generic summaries to
determine if the key concept in a given document is captured in the
summary.

The categories will be divided into sets of topics, each set related at
a broad level, e.g. business and sports, with five topics in each set. Each
topic will have approximately 100 documents.

Only the set of documents to be summarized will be provided to the
evaluation participants; the topics will not be provided.
Summarization systems developed by the participants will automatically
build a generic summary of each document.

The system must treat each test text as an individual document,
isolated from the others in all ways; it may bring to bear no knowledge
that was amassed from the test corpus during the evaluation. The
assessor will read a summary and categorize it into one of the five
related topic areas in a set, or 'non-relevant', which can be
considered a sixth category.

2.2 Adhoc task:

The goal of the adhoc task is to evaluate user-directed summaries to
determine if each summary effectively captures the information sought
by the user, as stated in the query or topic that retrieved the full
text document.

There will be approximately 20 topics and 50 documents
for each topic. Both the topics and corresponding document sets will be
provided to the participants. Summarization systems developed by the
participants will automatically build a summary using the topic as the
indication of user interest.

The assessor will review a topic, then read each summary
and judge whether or not it is relevant to the topic at hand.

2.3 Format

All submissions will be ASCII text, following a specified DTD (see
http://www.tipster.org). The summary will include tags for the document, a
participant identifier, a document identifier, a title, a summary, and
a query number (adhoc task only). The data may be presented in a
format of the participant's choosing, within the range of readable
ASCII text. No additional formatting is allowed (e.g. highlighting,
bold-facing, underlining).

2.4 Evaluation Criteria

The categorization and adhoc tasks highlight the acceptability of a
summary for a given task, with the assumption that there is not a
single 'correct' summary. The main purpose is to determine if the
assessor would make the same decision with the summary as they did
with the full text, and how long that decision took. For each task, we will
record the time required to make each decision, and the actual
decision. The decision for each assessor will then be compared to the
TREC decisions. Analysis of the results will include consideration of
the effect of summary length on the time taken to make the relevance
decision as well as its effect on decision accuracy.

2.4.1 Quantitative measures:

Categorization/Relevance Decisions
Compare the precision and recall accuracy of the assessors on both the
summaries and the full text to the TREC assessors' judgments. Scores
will be reported in a variety of formats, including:

- straight recall/precision (R/P) measures for all summaries against
ground truth (Resources permitting, summarization assessors will create
the ground truth, otherwise, TREC assessments will be used.)
- R/P for fixed 10% summaries
- R/P for best summaries
- R/P for best summaries combined with compression ratio
- R/P for individual document sources

Time Required
The time required to make a relevance or categorization decision using
a summary will be recorded and compared with the time required to make
the same document decision using the full text.

Summary Length
Each participant may submit up to two summaries for each document.

1) A maximum cutoff length of 10% of the original document length. Any
summary exceeding that margin will be truncated. The 10% limit is
based on number of characters between the <TEXT> and </TEXT> tags,
excluding whitespace. If the limit ends in the middle of a word,
completion to the end of the word is permitted. A length program will
be made available (late January) for this purpose.

2) The 'best' summary of the document, as determined by each
participant's system.

2.4.2 Qualitative measures:

User Preference
Assessors will be asked to evaluate each document for desired length
(shorter, just right, or longer), readability (poor, acceptable,
excellent), and certainty of decision (uncertain, fairly certain, very
certain). The information will be compared to the precision and recall
for the document sets.

3. INFORMATIVE SUMMARIES: Q&A TASK

The Question and Answer (Q&A) test is task-oriented, in that the summaries
are imagined to represent the intermediate stage of report writing that an
analyst would go through. The Q&A test does not provide an assessment of
ultimate utility for the report-writer, but rather an assessment of
potential utility at an intermediate step in the report-writing process.

The test is designed to assess the quality of summaries on the basis of the
number of correct answers they provide to a set of questions that reflect
the obligatory aspects of a topic. (The obligatory aspects are those that
must be satisfied for a document to be judged relevant to the topic.) The
same set of questions is used for all documents; the questions do not vary
from one document to the next. The challenge for the systems is to
understand the topic in relation to each document and to produce a summary
that covers all obligatory aspects of the topic in as short a summary as
possible.

The results from this evaluation task will be identified as
experimental, since the task is in the early design stages.

3.1 Corpus: topics, documents, and questions

A modified version of the data used in a pilot study will serve as the
training corpus.

The topics and documents in the test corpus will be taken from the adhoc
task corpus, described above. There will be three topics. Participants
will submit summaries for all documents in each topic set. Only the
topic-relevant documents will be judged.

A set of questions will be prepared for each topic. The questions will
pertain to obligatory aspects of the topic; there will be approximately
five questions per topic.

No information concerning the test corpus (topics, questions,
documents) will be divulged in advance of the 2/10/98 (when test data
will be made available). Test data will include the selected topic
from the adhoc task and the corresponding set of documents.

3.2 Evaluation criteria and scoring

Assessment is based on alignment of text strings in a summary with strings
in an answer key, which consists of passages in the full text that have
been identified as providing correct answers to the questions. The same
person who wrote the set of questions for a topic will create the answer
key for that topic. (The training corpus includes an answer key.)

Assessment of the informativeness of the summaries will be based on the
correspondence between the sentences of each summary and the sentences of
the full text that were identified as providing answers to the questions.
The quantitative measure to be used is termed Answer Recall, which is based
on the scoring categories of Correct, Partially Correct, and Missing.
Answer Recall scores will be computed across the document set for each
question separately as well as for all questions together.

As a means of determining the tradeoff between summary content and
summary length, overall Answer Recall will be compared to the average
compression of the response summaries. To serve as a target measure of
the degree of compression achievable on the test, evaluators will
prepare for each document a model summary that yields 100% recall. The
model summaries will consist of full sentences, and will include the
minimum number of sentences that are required to answer all questions,
plus any sentences that the evaluators feel are necessary to make the
question-answering content coherent. (The training corpus includes
model summaries, prepared by one of the evaluators.)

Alignment and scoring will be carried out manually. In scoring, it will be
important to maintain consistency of scoring criteria across summaries and
systems. (It is anticipated that the same person who prepared the data for
a given topic will do the scoring for that topic. Thus, it is expected
that there will be three different people involved in scoring on the Q&A
task. A small inter-evaluator test is planned that will gauge the
difference in scoring stringency across topics.)

3.3 Submission of summaries

Participating sites are limited to submitting one set of summaries per
topic. There is no restriction on the length of the summaries; however, it
is anticipated that sites will want to aim for an average length that does
not exceed 30% of the length of the full documents.

Summaries do not have to consist of literal sentence extracts from the full
documents; allowances in alignment and scoring will be made for summaries
that diverge from the literal input.

Summaries are to be submitted in the form of SGML documents, in accordance
with instructions outlined in 2.3, above.

4. INSTRUCTIONS TO RESPONDING FOR THE CALL FOR PARTICIPATION

Organizations within and outside the U.S. are invited to respond to
this call for participation. By the time of the actual testing phase
of the evaluation, systems must be able to accept texts without manual
preprocessing, process them without human intervention, and output
summaries in the expected format.

Due to the short suspense for this CFP, it is expected that
organizations that have already developed summarization systems will
participate in this evaluation, which is still in the initial
stages.

Organizations wishing to participate in the evaluation and workshop
must respond by 1 February, 1998 by submitting a short statement of
interest via email to tfirmin@romulus.ncsc.mil and a signed copy of
the participation agreement via fax or surface mail.

The statement of interest should include the following:

a. a brief description of approach to summarization
b. summarization task
adhoc (10% submission required, 'best' optional)
categorization (10% submission required, 'best' optional)
question-and-answer (1 'best' submission allowed)
c. primary point of contact. Please include name, surface and e-mail
addresses, phone and fax numbers.
d. does your site have access to all TREC data (disks 1-5)?

The participation agreement can be accessed from the TIPSTER web page
(http://www.tipster.org). A signed copy should be sent by fax to
Therese Firmin, 301-688-9070, or by surface mail to Therese Firmin,
Dept of Defense, 9800 Savage Rd, Ft. Meade MD 20755-6000.

If interest in participating in the evaluation is higher than
anticipated, the number of participants will be limited based on the
information provided in the statement of interest.

If you have general questions or questions concerning the
categorization or adhoc task, please address them to
tfirmin@romulus.ncsc.mil (with cc to mjchrza@romulus.ncsc.mil).
Questions concerning the Q&A task may be addressed to sundheim@nosc.mil
(with cc to obrst@mitre.org).

SUMMARIZATION COMMITTEE
Michael Chrzanowski, Department of Defense
Therese Firmin, Department of Defense
Lynette Hirschman, MITRE
David House, MITRE
Inderjeet Mani, MITRE
Leo Obrst, MITRE
Sara Shelton, Department of Defense
Beth Sundheim, SSC
Sandra Wagner, Department of Defense