Re: Corpora: corpora variety/summary

From: Bill Fisher (william.fisher@nist.gov)
Date: Fri Sep 01 2000 - 15:03:23 MET DST

Next message: David Lee: "Re: Corpora: register and genre"

Previous message: Jem Clear: "Corpora: Leaving Collins Cobuild"
In reply to: Vladimir Rykov, PhD in Computational Linguistics, MOCKBA: "Corpora: corpora variety/summary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Vladimir -

This is not exactly to the point of your query,
but as a part of a long-standing effort here at NIST to
understand the factors that affect computer speech recognition
accuracy, I've done some preliminary work in calculating
what I call the <ital> diversity </ital> of a test-set
corpus, which is how varied the corpus is when seen
thru the eyes of an ngram language model of the type
almost universally used in speech recognition. It's supposed
to be like test-set perplexity, except you don't use an external
language model. I repeat a sort of jack-knifing experiment
a number of times, averaging the perplexity result: randomly
choose x% of the utterances and build a language model from
them, then compute the test-set perplexity of the other (1-x)%
of them. Ceteris paribus, the test-set corpus with lower
diversity should be easier to recognize. If you or anyone
else knows of a publication by someone already doing this,
I'd appreciate being told about it.

- Bill F.

Next message: David Lee: "Re: Corpora: register and genre"
Previous message: Jem Clear: "Corpora: Leaving Collins Cobuild"
In reply to: Vladimir Rykov, PhD in Computational Linguistics, MOCKBA: "Corpora: corpora variety/summary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Sep 01 2000 - 15:06:14 MET DST