Corpora: ARCADE: some answers

Jean Veronis (Jean.Veronis@lpl.univ-aix.fr)
Sun, 15 Mar 1998 16:24:08 +0100

Thanks to Dan Melamed for these very relevant and very stimulating
questions. I will try to answer some, but many points are still open for
discussion. Actually, we would like to adopt the same model as in the first
phase of ARCADE, i.e., that the methodology, protocol, metrics, etc. are
discussed and agreed among participants. We will build a discussion list
for (tentative) participants, on which this interesting discussion can be
continued.

>Since "participants cannot withdraw during the competition and accept
>the publication of the results," it is important to specify a priori
>exactly what the rules of the competition are.

This statement looks a bit scary. Its main purpose is to discourage
pseudo-participants who would register only to get the corpus and then
disappear. It is important that the participant committment is firm at the
time they receive the corpus. For the time being, all that is needed is a
statement of interest in order to start the discussion list. Above all, we
would like this competition to be relaxed and friendly, and scientifically
interesting, as the first one was!

>I am interested in
>participating, but I am concerned that vague rules will promote
>comparison of apples with oranges, and decrease the value of the
>exercise (which I think will be quite high otherwise).

I fully agree. However, as I said above, we would like that an agreement
arises from the discussion among participants themselves.

>Could you
>please clarify (at least) the following points?
>
>
>1. How automatic must participating systems be? Is a system allowed
>to ask a human for help on difficult cases?

I assume that there is no reason to exclude semi-automatic systems.
However, we will probably need to take this particularity into account
during the evaluation, for the sake of fairness. It would be good if the
participants could describe the extent of the manual intervention, and
maybe provide both sets of results, both with and without human help.

>2. What resources, in addition to the test bitext, are systems allowed
>to exploit? Other corpora? Dictionaries? POS-taggers? Obviously,
>the more resources, the easier the task. It would not be fair to
>compare systems that use different resources.

All resources will be allowed, otherwise we would have too few systems of
each type. The systems are going to be tested as black-boxes, and this will
enable a posteriori comparison of the usefulness of the various types of
resources. Results are not going to be published in a raw form, but along
with descriptions of systems, comments, etc. We plan a discussion phase in
which the participants will be able to comment/contest/explain the results.

>3. To what degree must the systems be language-independent? E.g. is
>it reasonable to rely on cognates? Is the point to find the best
>system for French/English or the best system for arbitrary language
>pairs? (If the answer is "both", then it might be best to formalize
>two separate tracks.)

In the phase planned until September, the only thing we can reasonably do
is to arbitrarily restrict the task to "French-English alignment",
regardless of the additional capabilities of systems. If some systems are
language-independent, this can be presented as a plus, which can explain
for (possibly) more modest results on a given language pair. Of course it
would be interesting to open a many-language alignment track, but we can
probably leave it for a next round.

>4. What are the evaluation metrics? Only exact match with the gold
>standard? Or does a system get partial credit for being "close"?
>Either way, which objective functions are of interest --- precision,
>recall, Dice, F-measure, 11-point average precision, or what?

In the first phase (which was devoted to sentence alignement only), we used
precision/recall/F-measure. These functions were computed in four different
ways (see our report on the Web), and one of the interesting fallouts of
the competition was actually to compare and understand better the
properties and qualities of the various metrics. It is clear that at the
word level other metrics could be interesting, such as the Dice similarity
coefficient. I personnaly would have no objection that any metric proposed
by the participants be computed (provided it is not too labour extensive).
In any case, it is likely that a ranking of systems along different axes is
more interesting than a ranking using only one measure. We may therefore
have several winners depending on the point of view.

>5. In the sentence category competition, are systems expected to
>recognize inversions or only monotonic alignments? Inversions are
>surprisingly frequent.

Some metrics are not sensitive to inversions, but some are, are therefore
if inversions are recognized, systems will get a better score with those
metrics.

>6. In the word category competition, what are "words?" Who decides
>the tokenization and on what basis? Note that it is not enough to say
>that we care about only the 60 selected French words --- it is also
>necessary to specify how English words should be tokenized and counted
>in the "correct" translations.

The alignment will be French to English, i.e., systems will have to spot
the translations of French words in the English text (we tried to simplify
the task as much as possible; a bidirectional evaluation could be
undertaken in the future, if this one is successful). The French words
which will be submitted to systems are not multi-token units (again, in
order to simplify the task). However, their translation can be multi-token.
This point is related to the one below.

>7. In the word category competition, on what basis will the "correct"
>translations be determined? E.g. what are the rules for matching one
>of the 60 selected words when it appears as part of an idiomatic
>expression, or when its translation is an idiom, or both?

The problem is very difficult. I am thinking for example about cases where
a word is not directly translated, but replaced by a different syntactic
construction. What we plan to do (but again, suggestions are welcome), is
to have the reference corpus aligned by humans, and rely on their judgement
for what is aligned to what. If we are able to have several human
alignments, it would give a good basis for comparison -- systems should not
be expected to agree more with the reference than humans among them! In any
case, the reference will be distributed after the competition, and during
the discussion phase, the participants will be able to contest the dubious
alignments.