Re: [Corpora-List] Query on Linking Text & Sound Files

From: Jean Veronis (Jean.Veronis@mailup.univ-mrs.fr)
Date: Sat Oct 19 2002 - 19:14:28 MET DST

  • Next message: Thomas Schmidt: "AW: [Corpora-List] Query on Linking Text & Sound Files"

    We have a reasonably good experience of text-sound alignement in my team,
    since we have aligned more than 500,000 words of transcripts at this point.
    The technique we use has been developed by a student of mine in her PhD
    thesis (in French) :

    Campione, E. (2001). Etiquetage prosodique semi-automatique de corpus oraux
    : algorithmes et méthodologie. Thèse de doctorat. Aix-en-Provence:
    Université de Provence [online :
    http://www.up.univ-mrs.fr/delic/theses/resume-campione.html]

    Our strategy is to align as we transcribe, with the Transcriber tool
    already mentioned by Khalid Choukri on this list
    (http://www.etca.fr/CTA/gip/Projets/Transcriber/), but it can be used on
    pre-existing transcripts as well, although it is a bit less practical.

    The strategy is based on a pre-segmentation of the sound files by means of
    a pause detector. Pause detection is fairly reliable (90-95% precision and
    recall, depending on language and type of speech -- more results p.200 of
    the thesis). It produces segments of a few seconds, which is the perfect
    span for transcribing audio files, since it matches fairly well what the
    transcriber can memorise at a time. We actually found that using this
    technique, the transcription time was not increased as compared with our
    old technique using a simple tape recorder, and the alignement was given as
    a bonus. In addition, the result is more precise than the old methods of
    transcription, because the transcriber can replay the exact segment at
    will, which was rather impractical with tape recorders and resulted in
    reluctance to listen several times to the same segment.

    Hope this helps.
    --jv

    Jean Véronis
    http://www.up.univ-mrs.fr/veronis/



    This archive was generated by hypermail 2b29 : Sat Oct 19 2002 - 21:27:15 MET DST