On the future of the Penn Treebank Project

Mitch Marcus (mitch@linc.cis.upenn.edu)
Mon, 26 Feb 1996 12:36:27 -0500 (EST)

The Penn Treebank project is now at a crossroads, and we greatly need
input and discussion from the empirical NLP community as to whether
the project should continue, and if so, in what directions.

Since its inception in 1989, the Treebank has produced 2 CD-ROMs of
annotated material, available through the Linguistic Data Consortium.
The first CD-ROM contained 2.8 million tokens hand-parsed in Treebank
I style (primarily Brown Corpus and Wall Street Journal), and 4.8
million tokens tagged for POS. The second CD-ROM, contains 1.2
million tokens of Wall Street Journal, redone in Treebank II style (a
MUCH richer grammatical annotation, with much tighter quality control
- for those of you who've only seen the first CD-ROM), plus the
entire contents of the first CD-ROM.

We are now annotating natural phone conversations from the Switchboard
corpus, and have completed 1.6 millions words annotated for part of
speech and for speech disfluencies, with 750K words annotated for
grammatical structure in the Treebank II style. It now appears that
we won't be doing much more annotation of this material, and will
release a CD-ROM by summer.

There are important indications that the the project has now created
enough materials to keep the research community occupied for a while.
interest in commissioning the preparation of specific new materials
for the project to continue. From what I know of, there appears to be
only limited need for new materials; it doesn't appear that this need
is sufficient to sustain the project.

SO, (a) IS there a need for more materials at this point?
(b) If so, what materials?
(c) If so, who do you propose should pay for them?

To make my position clear, the Treebank Project is here to support the
needs of the research community. If we've fulfilled the need for
materials of this kind (there's MUCH structure in the Treebank II
materials that no one has yet tried to exploit, for example), then
we've done our job and can go on to other things. On the other hand,
I'd be delighted if there are needs to be met.

Mitch Marcus