[Corpora-List] misalignments in Boston University Radio Speech Corpus

From: Sabine Buchholz (sabine.buchholz@crl.toshiba.co.uk)
Date: Wed May 26 2004 - 17:46:20 MET DST

  • Next message: Linguistic Data Consortium: "[Corpora-List] New Corpora from the LDC"

    Dear list members,

    I am supervising a student who works with the .wrd, .brk and .pos files of
    the Boston University Radio Speech Corpus. Although in theory all these file
    types should contain the same number of words/lines for any given file name,
    in practice there are many differences. For example, in one file
    "school-based" or "they're" are treated as one word and in another as two.
    I guess that everybody who has worked with these files will have noticed
    this at some point and I wondered how other people dealt with it. Does
    anybody have a script to correct at least the easy cases? Or are there newer
    versions of the corpus where this has been corrected?

    Thank you very much for any information,
    Sabine Buchholz

    _____________________________________________________________________
    This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com



    This archive was generated by hypermail 2b29 : Wed May 26 2004 - 17:56:59 MET DST