I have a query regarding the D measure, - Index of Dispersion. I understand
that it is based on the distribution of the frequencies of word-type over n
number of sub-corpora (categories). D= 0.000 when all occurrences of the
word-type are found in a single category regardless of frequency. The value
1.000 indicates that the word-type's frequency of occurrence is distributed
over the n categories eactly proportionally to the total number of tokens in
those categories.
Do the categories or sub-corpora have to be equal in token size?? - My
previous sentence would suggest not, but in most studies I have heard about
the sub-corpora have always been equal!!?
The formular is given as:
D=[log(sumpi) - (sumpi)/sumpi]/logn
where
n=the number of categories
i=the category number,1,2,3....n
pi = the probability of a token in the ith category and pilogpi=0 for pi=0.
Thanks in advance,
James Purchase
Language Centre
HKUST