Re: Corpora: corpora variety/summary

From: Bill Fisher (william.fisher@nist.gov)
Date: Fri Sep 01 2000 - 15:03:23 MET DST

  • Next message: David Lee: "Re: Corpora: register and genre"

    Vladimir -

       This is not exactly to the point of your query,
    but as a part of a long-standing effort here at NIST to
    understand the factors that affect computer speech recognition
    accuracy, I've done some preliminary work in calculating
    what I call the <ital> diversity </ital> of a test-set
    corpus, which is how varied the corpus is when seen
    thru the eyes of an ngram language model of the type
    almost universally used in speech recognition. It's supposed
    to be like test-set perplexity, except you don't use an external
    language model. I repeat a sort of jack-knifing experiment
    a number of times, averaging the perplexity result: randomly
    choose x% of the utterances and build a language model from
    them, then compute the test-set perplexity of the other (1-x)%
    of them. Ceteris paribus, the test-set corpus with lower
    diversity should be easier to recognize. If you or anyone
    else knows of a publication by someone already doing this,
    I'd appreciate being told about it.

     - Bill F.



    This archive was generated by hypermail 2b29 : Fri Sep 01 2000 - 15:06:14 MET DST