Re: Certainty of a unique permutation in a corpus

E S Atwell (eric@scs.leeds.ac.uk)
Mon, 2 Oct 1995 13:37:47 +0100

I was hoping to see some other replies before sticking my neck out, but...

my intuition is that there is NO standard way of calcualting the absolute
probability that your hypothesis is correct. Scientific hypotheses in
general can be empirically DISproven, but not proven.

However, there may be a partial, pragmatic solution.
Although earlier work on machine learning of english syntax (Gold
etc) worked on the principle that you HAD to have negaitve as well as
positive examples (ie the ML system had to be explicitly told that abcd is
fine but acbd is illegal, no amount of positive examples could let it infer
that abcd was not allowed), this assumed the Chomskyan model of a grammar
as defining a clear-cut set of well-formed sentences (and excluding
ill-formed sentences). More recent research on stochastic grammars has
largely abandoned this in favour of models which can assign a "grammaticality
measure" to ANY string, so none are completely illegal. Your corpus
evidence may provide evidence that, say, abcd has probability 0.9,
whereas acbd has probability 0.000005. You are free to interpret this
as you wish, for example to say that every sequence below a threshold
probability is "illegal". Unfortunately many studies have shown that
letter/word/tag/pars-constituent frequecny distributions are not Normal
but Zipfian: you would expect to find many sequences with low frequency
in the corpus (most 4-tuples would have frequency= 1), so it is very unsafe
to assume a sequence with frequency = 0 could not occur in a larger sample
of the same language.

I hope someone comes up with a better reply than this!

____________________________________________________________

Eric Steven Atwell
Centre for Computer Analysis of Language And Speech (CCALAS)
KBS+SALT Coordinator, HEFCs-JISC New Technologies Initiative
Artificial Intelligence Division, School of Computer Studies
The University of Leeds, LEEDS LS2 9JT, Yorkshire, England
TEL:0113-2335761 FAX:0113-2335468 EMAIL:eric@scs.leeds.ac.uk
____________________________________________________________