Corpora: Linking audio and text: SUMMARY

Adam Kilgarriff (Adam.Kilgarriff@itri.brighton.ac.uk)
Mon, 8 Sep 1997 18:00:51 +0100

LINKING AUDIO AND TEXT - SUMMARY OF RESPONSES
=============================================

A couple of months back, I sent out the following to the corpora
list. Below I attach a summary of responses (and an email list of
contributors).

Thanks to Vicky Waite for help in the compilation.

Adam Kilgarriff

> Does anyone know of software for linking audio material and
> transcripts of it, so that the transcript can be used for searching
> and then a hyperlink gets you to the appropriate spot in the
> audio to hear what was being said? The material is mostly
> conversation, and I'd be very interested to hear about both automatic
> methods, and `workbenches' for helping people insert links swiftly and
> accurately.

SUMMARY OF RESPONSES
====================

CHRIS BREW from the Language Technology Group, HCRC, at Edinburgh
University recommends their website
http://www.cogsci.ed.ac.uk/hcrc/wgs/dialogue/corpus_interface
Their CGI scripts may also be available, and they have a tool for
doing fast efficient time stamping.

RICHARD CAULDWELL, Univ of Birmingham: "I have been experimenting for
a number of years with linking transcriptions and audio recordings.
The way I have done it is by having the audio on CD-Audio, and then
using hyperlinks which enable me to click on a transcription to play
the relevant part of the CD-Audio. My inital work (1990-1996) was on
an Apple platform using the Voyager Audiostack (used by others to
author Beethoven Ninth Symphony, and Multimedia composers). I have
written this up in two locations: in Speak Out Newsletter of the
IATEFL Phonology Special Interest Group Number 10, August 1992, pages
29-34 - 'Of Streams and Bricks: new ways of presenting the spoken
language to learners'; and in System Volume 24 No. 4 pp 521-528
'Direct Encounters with fast speech on CD-Audio to teach Listening'. I
have this year changed over to a PC platform, and I am just about to
create (I hope) Multimedia Toolbook applications to do the same on a
PC platform."

JOHN DU BOIS: "At UC Santa Barbara, we have developed workbench
software for aligning conversational transcriptions with the
corresponding digitized sound. It is called SoundWriter, and works
with Windows 3.1 or Windows 95. It is not yet available for general
distribution, however. The Corpus of Spoken American English may be
distributed with a version of this software." See also the chapter
in 'English Corpus Linguistics' ed Karin Aijmer and Bengt
Altenberg, Lund University Press, 1991.

ENTROPIC: A couple of people pointed to 'Aligner' software produced
by Entropic:
http://www.entropic.com/aligner.html

HELMUT FELDWEG from Tuebingen University suggests the CHILDES
software, and the facility called sonic CHAT which was made just for
the purpose described.

DJOERD HIEMSTRA: "In 1998 a project named "Olive" will have its
kick-off. Objective of Olive is indexing Video material by using both
the sound-track and scripts. Aligning the scripts with the actual
sound-track using speech recognition tools will be one of the
techniques envisaged. Olive will be sponsored by the Telematics
Application Program of the European Union, Sector Language
Engineering"

DOUG OARD says that the informedia project at CMU is doing precisely
this using closed caption text and speech recognition. They align
closed caption with the (noisy) output of the speech recognition, and
then work backwards from the speech recognition to the audio that it
resulted using time tags. You can find their web page at
http://www.informedia.cs.cmu.edu/

PETER ROACH at Reading University produced a paper with Simon Arnfield
on how they did this task fr the MARSEC corpus, which was included in
"Spoken English on Computer" (eds G. Leech, G. Myers and J. Thomas,
Longman 1995).

JORDI ROBERT-RIBES, CSIRO, NSW, Australia, said: "We have recently
worked on a prototype for ALTA (Automatic Linking of Transcript and
Audio. It has given good results on some TV programs. We use it to
generate links between the transcript and the video/audio of digital
collections. Therefore we do not need to run on real-time. That allows
us some flexibility.
A paper about that work was accepted for Eurospeech'97 (22-25
Sept). An oral presentation will be given there."

TONY ROBINSON of Softsound has been working on the problem, and has
been aligning audio with text. A student has developed an interface
where you can click on a word and hear the audio (including context if
that is what you want).
http://www.SoftSound.demon.co.uk/

TONY ROSE of the Canon Research Centre Europe, pointed to "Structuring
voice records using keyword labels", NJ Haddock, Proc CHI'96,
Vancouver. The approach is slightly different since the transcript
doesn't exist in advance. Instead, keywords in the audio stream are
detected and then transcribed by the recognition process. These are
then used as visible labels to index the various segments of interest
in the audio stream.

ALAN SMEATON from the MMIR group at Dublin City University has been
working in related areas in a few projects.
* Virtual Lectures, for which they have 15 hours of lecture
material and a manual transcription. They index the transcription by
breaking it into overlapping windows and weighting terms in the
windows depending on the position within a window. Thus if you search
for "OPTIMISATION" then clearly you want to retrieve window 1 because
when you play that segment the context of the word is set.

Window 1 |--------------------OPTIMISATION--------------------|
Window 2 |--OPTIMISATION----------------------------------|

They also allow access via a back of the book index and via a
traditional table of contents. The VL project has a demo if you have
RealAudio installed:
http://lorca.compapp.dcu.ie/~asmeaton
follow link to "My Research"
* Related to the VL project, they are taking part in
TREC-6. This has a spoken document track of 50 hours of radio/TV
news. They also have two sets of transcripts:- exact, corrected ones,
and noisy ones as in the output of speech recognition. They use the
noisy transcripts, turn them into trigrams of phones, do likewise with
queries and retrieve. They then return segments of audio broadcasts.
* They are using both these pieces of work to index radio
broadcasts from RTE (Irish State) radio news by trigrams of phones
using a HMM speech recogniser, and a result of a query will launch
into a segment of radio broadcast. In this case they don't have
transcripts to retrieve against, but a stream of phonemes which they
break into overlapping windows in the same way as the virtual
lectures.
As to the problem of a toolbox to create the links, they
match linearly. Eg if a spoken segment is 60 seconds long and the
transcript is 120 words and you want to launch from the 41st word, it
is 20 seconds into the segment.

HONGYIN TAO recommended Waves Plus for Unix systems.

SYUN TUTIYA from a project team at Chiba University in Japan refers to
the HCRC Map Task Corpus on CD ROM from Edinburgh University. They
have a replication of this at Chiba University for Japanese. They
point out a basic problem to do with the format of sound files. Both
they and the Edinburgh team use interleaved stereo sound format, but
others tend to think that one file for one segment is a good
idea. These two ideas require different tools, or one tool with
different options.

RALF VOLLMANN of Karl-Franzens University, Graz, suggests the S_Tools
workstation for the analysis of sound and acoustics research
laboratory, of the Austrian Academy of Sciences, Vienna, on
http://www.kfs.oeaw.ac.at
It allows sound files of any length to be
stored and used. Within each sound file, it is possible to make a
virtual segmentation of any part of the sound file.

MARC WEEBER from Gruningen University in the Netherlands says he has
worked with a combination of the CHILDES editor and the UNIX program
XWAVES. This allows you to 'point' to a part of the text and hear the
transcript. You can also see it in wave form. "The linking of sound and
text takes a lot of time, though."

Thanks to (and emails):

Eric Atwell eric@scs.leeds.ac.uk
Bengt Altenberg bengt.altenberg@englund.lu.se
Nancy Belmore belmore@vax2.concordia.ca
Chris Brew Chris.Brew@edinburgh.ac.uk
Richard Cauldwell cauldwrt@novell1.bham.ac.uk
Alex Collier alex@rdues.liv.ac.uk
George Demetriou george@scs.leeds.ac.uk
John Du Bois dubois@ccsg.tau.ac.il
Helmut Feldweg feldweg@sfs.nphil.uni-tuebingen
Djoerd Hiemstra hiemstra@cs.utwente.nl
David Lee d.lee@lancaster.ac.uk
Doug Ouard ouard@glue.umd.edu
Peter Roach p.j.roach@reading.ac.uk
Jordi Robert-Ribes Jordi.Robert-Ribes@cmis.csiro.au
Tony Robinson ajr@softsound.com
Tony Rose tgr@cre.canon.co.uk
Alan Smeaton asmeaton@compapp.dcu.ie
Hongyin Tao chstaohy@leonis.nus.sg
Syun Tutiya tutiya@kenon.ipc.chiba-u.ac.jp
Ralf Vollmann ralf.vollmann@kfunigraz.ac.at
Marc Weeber m.weeber@farm.rug.nl
Lynn Wilcox wilcox@pal.xerox.com
Job van Zuijlen zuijlenj@verdi.iisd.sra.com

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Senior Research Fellow tel: (44) 1273 642919
Information Technology Research Institute (44) 1273 642900
University of Brighton fax: (44) 1273 642908
Lewes Road
Brighton BN2 4GJ email: Adam.Kilgarriff@itri.bton.ac.uk
UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%