Corpora: Summary: acquisition of subcategorization frames

Philip Resnik (resnik@umiacs.umd.edu)
Tue, 27 Jan 1998 15:36:31 -0500 (EST)

In December I posted a query seeking pointers to work on acquisition
of subcategorization frames for languages other than English. I'd
like to thank the following people for their replies:

Sabine Buchholz (s.buchholz@kub.nl)
Glenn Carroll (glenn@IMS.Uni-Stuttgart.DE)
Eugene Charniak (ec@cs.brown.edu)
Oliver Christ (oli@trados.com)
Ann Copestake (aac@csli.Stanford.EDU)
Judith Eckle-Kohler (eckle@IMS.Uni-Stuttgart.DE)
Dimitrios Kokkinakis (svedk@svenska.gu.se)
Takehito Utsuro (utsuro@is.aist-nara.ac.jp)
Diana van der Ende (dvdende@worldonline.nl)
Eric Ringger (ringger@cs.rochester.edu)

A summary follows.

Philip Resnik, Assistant Professor
Department of Linguistics and Institute for Advanced Computer Studies

1401 Marie Mount Hall UMIACS phone: (301) 405-6760
University of Maryland Linguistics phone: (301) 405-8903
College Park, MD 20742 USA Fax : (301) 405-7104
http://umiacs.umd.edu/~resnik E-mail: resnik@umiacs.umd.edu

----------------------------------------------------------------

- Work by Judith Eckle-Kohler <eckle@IMS.Uni-Stuttgart.DE, also
eckle@csli.stanford.edu> on acquiring subcat information for German
from corpora. Approach: a regular-expression based partial parser to
search for cues that (almost) unambiguously signal certain frames.
Uses IMS Corpus Query Toolbox and predefined search patterns.

Judith Eckle, Ulrich Heid: ``Extracting raw material for a German
subcategorization lexicon from newspaper text''. In: Proceedings of
the 4th International Conference on Computational Lexicography,
COMPLEX '96. Budapest, 1996.

Paper available at: http://www2.ims.uni-stuttgart.de/~eckle/

Dr. Eckle-Kohler writes: "I haven't used any statistical filters;
instead, I have checked the extraction results manually. The main
reason for that was to get exact precision measures of the extraction
procedures: the evaluation against MRDs is mostly not possible because
of their incompletenesses. I have acquired subcat-frames for verbs,
nouns and adjectives; the verb-lexicon extracted form corpora
currently contains 6028 readings (i.e. frames) for 3727
verbs. Currently my extraction procedures for verbs cover 104
subcat-frames. The two papers which are available on my Home-Page
(the latest one being just a draft-version, but showing the
acquisition procedures more clearly, I think) might give you an idea
of the approach I chose."

- Work by Glenn Carroll, Mats Rooth, Marc Light, et al. on automatic frame
acquisition for German. Approach is acquisition of a stochastic
subcategorization lexicon while doing parameter estimation
for a stochastic lexicalized CFG.

- Draft paper by Takehito Utsuro: "Linguistic knowledge extraction",
in "A Handbook of Natural Language Processing"
(http://www.mri.mq.edu.au/nlu/nlphandbook/index.html), with a draft of
a section on "Extraction of Subcategorization Frames". Includes
pointers to works on the acquisition of subcategorization frames for
English, Japanese, and German. Referred-to-papers for Japanese are at
http://cactus.aist-nara.ac.jp/staff/utsuro/publication-e.html.

- Dimitrios Kokkinakis is writing a PhD thesis concentrating on
acquisition of subcategorization frames, working with Swedish
material. His paper, "Corpus-Based Argument Identification using a
Statistically Enriched MRD", appeared at "a workshop organized in
Toulouse in Aug 96" for which apparently Kluwer will publish
proceedings in spring 98. ("Statistically enriched" refers to mutual
information.) He writes that he "can now, for instance, find
automatically over 95% of continuous and discont. phrasal verbs in
texts, a phenomenon that is very frequent in Swedish and has been a
bottleneck for further processing of the texts."

- Diana van der Ende writes that the Celex database in Nijmegen, The
Netherlands has at least a database with subcategorisation frames on
Dutch. German is being worked at currently, but they may have
subcategorisation frames already for German. For relevant information,
see: http://www.kun.nl/celex/.

- Erika F. de Lima, "Acquiring German Prepositional Subcategorization
Frames from Corpora" appeared at the 5th Workshop on Very Large
Corpora (WVLC-5) in Beijing (http://www.lexis-nexis.com/WVLC-5/final.html).

- Sabine Bucholz adds pointers to: the SPARKLE project (for English,
French, German and Italian) <http://www.ilc.pi.cnr.it/sparkle/sparkle.html>;
papers by Thorsten Brants using a statistical tagger, including
"Tagging Grammatical Functions" (EMNLP-97) and others
<http://www.coli.uni-sb.de/~thorsten/>; the B7 Project, "Partial
Parsing and the Acquisition of Lexical Syntax and Semantics" (subcat
frames and selectional restrictions for English and German)
<http://www.sfs.nphil.uni-tuebingen.de/~abney/b7home.html>; Salah
A\"it-Mokhtar & Jean-Pierrre Chanod, "Subject and Object Dependency
Extraction Using Finite-State Transducers," ACL'97 Workshop on
Information Extraction and the Building of Lexical Semantic Resources
for NLP Applications, Madrid, July 7th-12th 1997
<http://www.rxrc.xerox.com/publis/mltt/mlttart.html> (for French);
work on Portuguese by Nuno Miguel Cavalheiro Marques
<http://www-ia.di.fct.unl.pt/~nmm/>; and her own work on English,
doing subcat acquisition by means of machine learning.