Evaluating taggers

Steven Abney (abney@sfs.nphil.uni-tuebingen.de)
Tue, 18 Jul 95 13:19:26 +0200

John Nerbonne writes:
The point is not that there are theory-neutral systems of bracketing,
but rather that all systems agree on some of the bracketings in
practically all cases, e.g., NP and PP constituents (also subordinate
finite clauses). We clearly can evaluate grammars/parsers on their
accuracy here.

I'm less sanguine about that. A few years ago there were a couple of
meetings at UPenn on grammar/parser evaluation that I took part in. A
proposed evaluation metric came out of those meetings, based on the
idea that most parsing schemes agree on at least the major phrases,
like NP and PP, and that you could avoid penalizing parsers for
inserting extra phrases or omitting minor phrases if you only penalize
parsers for incompatible nodes (`crossing brackets') vis-a-vis a
stripped-down version of a `standard' parsed corpus like the UPenn
treebank.

The hope was that by watering down the evaluation criteria enough, you
could get a fair comparison across grammatical formalisms, without
forcing everyone to translate their output to a single grammatical
scheme.

But a parser I was using at the time scored extremely poorly by this
metric, because it marked the non-recursive kernels of phrases
(`chunks'), but did not attempt attachment. As a result, there were
crossing brackets all over the place, e.g. compare:

chunks: [*PP in [*NP a letter]] [PP to [NP Bill]]
standard: [PP in [NP a [NBAR letter [PP to [NP Bill]]]]]

in which the starred PP and NP cross with the NBAR in the standard,
and the starred PP crosses with the NP in the standard.

The point is that, even though at some level the chunks and the
standard are saying the "same thing", they're not saying it in the same
way. The only alternative seems to be translating to a common scheme
-- but then we're not doing a scheme-free comparison.

Translating to a common scheme wouldn't be so bad if it weren't too
onerous. Unfortunately, I think the usefulness of the evaluation is
directly related to the painfulness of translating to the common
scheme. We could extract just the noun phrases out of the Penn
Treebank, or out of Susanne, and get precision and recall scores for
our parser relative to that standard. But: (1) that only tells us how
well we're doing at noun phrases, not the rest of the grammar, and (2)
there will be plenty of places where the parser disagrees with the
standard, not because the parser is wrong, but because it's doing
things differently from the way the standard does them.

Here's a little experiment Don Hindle and I did. We wrote down a very
brief stylebook for a highly simplified parsing scheme. (We were just
marking the beginnings and heads of major phrases -- NP, PP, VP, S --
so as to completely avoid PP-attachment questions and the like.) Then
we each parsed a bit of the Brown corpus, and compared the results.
There were a bunch of disagreements, and in virtually every case, our
sense was that we agreed about what was really going on, but we had
chosen a different way of encoding it, since the stylebook hadn't
explicitly said how to encode it. To give a couple of concrete
examples: is 'according' in 'according to Republicans' a preposition
or a participle? Is 'only' in 'only a relative handful' inside or
outside the noun phrase? Is '3-C' the head of 'apartment 3-C', or is
'apartment' the head?

In each of these cases, Don and I had made different choices. We
weren't particularly committed to the choice we made, but we had to
make *some* choice. So we arbitrated our disagreements, added some
more paragraphs to the stylebook, and tried again. After several
iterations of this, we were pretty convinced that it wouldn't stop
until we had a stylebook the size of the Penn Treebank's or Susanne's.

In short: (1) scheme-free evaluation only means evaluation with
respect to a watered-down scheme, (2) the more watered-down a scheme,
the less powerful the evaluation, (3) evaluating a parser (or tagger
or whatever) against a scheme it wasn't explicitly designed for is
pretty useless unless you can quantify the error of evaluation due to
the scheme differences, and the error due to scheme differences is
almost surely not negligible.

Best regards,
Steve Abney