Discourse analysis from corpora

Chris Brew (chrisbr@cogsci.ed.ac.uk)
Tue, 18 Apr 95 17:16:30 +0100

> Hi everyone,
> My friend and I are new to corpora research and have to do a paper on
> this. The question: we were wondering whether there is anything done on
> spoken/written discourse analysis based on an existing (public) corpora.
> So far, we've found none.
> Could somebody enlighten us, please? Thank you.
>
> Best,
> Naomi Fujita & Apisak Pupipat
> Doctoral students in applied linguistics
> Teachers College, Columbia University, NYC

Reply forwarded from Jean Carletta (Jean.Carletta@edinburgh.ac.uk

Currently many discourse researchers are starting to look
at applying analyses to corpora, but to my knowledge there
is little work published so far because this style of work
is so new and because it is also very time-consuming. The very
best source of information is the working notes from the AAAI Spring
Symposium on Empirical Methods in Discourse Interpretation
and Generation; it was held in March 1995 at Stanford University,
and it was organised by Lyn Walker (walker@merl.com) and Joanna
Moore (jmoore@cs.pitt.edu). Working note distribution itself
is limited to participants, but I believe that there should be
a summary coming out shortly as a AAAI tech report. They intend
for a list of available corpora to be one part of the report, so
if you are asking in the hopes of finding one to work on, this
may be your best resource. The corpora which are available tend
to be distributed by the Linguistic Data Consortium
(ldc@unagi.cis.upenn.edu, or better yet, the catalogue
is on ftp://ftp.cis.upenn.edu/pub/ldc).

Given that, I'll briefly describe what work I know on dialogue
structure and monologue or text structure. I don't know if dialogue
counts as discourse in your definition, but it does in many and this
is where much of the effort has been concentrated. I assume in this
that you are familiar with basic dialogue and discourse analysis but
haven't seen it in large-scale use corpora.

The Human Communication Research Centre (Universities of Edinburgh and
Glasgow) has the best developed ways of marking up dialogue structure
on a corpus of human-human task-oriented dialogue. They have
developed three levels of analysis, from transactions (subdialogues
corresponding to the steps in the participants' plan for completing
the task), conversational games (a hierarchical representation of
discourse goals, much like adjacency pairs or dialogue games), and
conversational moves (individual initiations and responses which make
up the game structure). Their analysis is described in HCRC tech
reports 31 and 65, available from
http://www.cogsci.ed.ac.uk/~ftp/pub/HCRC-papers/index.html
or by anonymous ftp from
scott.cogsci.ac.ed.uk[129.215.144.3]:pub/HCRC-papers
The corpus which they use (described in Anderson et al., ref below)
is available on CD-ROM, although the dialogue structure markup
is not currently publicly available. A much better description of
the coding distinctions is currently in preparation and should
be available by late May; if you're interested I can put you on the
distribution list.

There are several other groups also marking dialogue structure at the
moment. One attempt is described briefly in Alexandersson et al, with
more details in a German tech report referred to in the paper. Other
attempts involve Jan Wiebe (wiebe@cs.nmsu.edu; just starting to design
a coding), Sherri Condon (slc6859@usl.edu; coding description
available from her), Nils Dahlback (nilda@ida.liu.se), and David Traum
(traum@cs.rochester.edu; coding the TRAINS corpus of dialogues). I
am not sure which of these corpora are publically available so far.

I think that work on marking monologue/text structure is less
established. Megan Moser and Joanna Moore (moser@isp.pitt.edu) are
beginning to mark up a structure on explanations which combines
elements of Grosz and Sidner's discourse theory and rhetorical
structure theory (Mann and Thompson). Elizabeth Liddy (liddy@syr.edu)
has been marking up the Wall Street Journal corpus (available from
LDC, I think). There is also recent discussion about whether or not
it is possible to segment discourse replicably according to various
theories (Passonneau and Litman, Greene and Cappella; also unpublished
work by Lisa Stifelman, lisa@media.mit.edu). I suspect that if there
hasn't been more work on actually marking discourse structure on
corpora, this is why; segmentation is a more basic issue which
is still being addresses.

@inproceedings{EACL95:VERBMOBIL,
author = "Jan Alexandersson and Elisabeth Maier and Norbert Reithinger",
year = 1995,
title = "A Robust and Efficient Three-Layered Dialogue Component for a S
peech-to-Speech Translation System",
booktitle = "Proceedings of the Seventh European Meeting of the ACL",
place = "Dublin, Ireland",
pages = {188-193}
}

@article{HCRC,
author = "Anne H. Anderson and Miles Bader and Ellen Gurman Bard and Eli
zabeth Boyle and Gwyneth Doherty and Simon Garrod and Stephen Isard and Jacqueli
ne Kowtko and Jan McAllister and Jim Miller and Catherine Sotillo and Henry Thom
pson and Regina Weinert",
title = "The {HCRC} {Map} {Task} {Corpus}",
year = "1991",
volume = 34,
number = 4,
pages = {351-366},
journal = "Language and Speech"
}

@inproceedings{ACL93:Passonneau&Litman,
author = "Rebecca J. Passonneau and Diane J. Litman",
year = 1993,
title = "Intention-based segmentation: human reliability and correlatio
n with linguistic cues",
booktitle = "Proceedings of the 31st Annual Meeting of the ACL",
place = "Columbus, Ohio",
month = "June",
pages = {148-155}
}

@article{Greene&Cappella,
author = "John O. Greene and Joseph N. Cappella",
year = 1986,
journal = "Language and Speech",
volume = 29,
pages = {141-157},
number = 2,
title = "Cognition and Talk: The relationship of semantic units to tempo
ral patterns of fluency in spontaneous speech"
}

------- End of Forwarded Message