equal corpora sizes??

James Purchase, Language Centre, Tel:358 7862 (LCGUIBI@usthk.ust.hk)
Thu, 18 Apr 1996 19:39:20 +0800

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: RS-FORSYTH@wpg.uwe.ac.uk: "taggers"
Previous message: Eric Brill: "EMNLP Conference"

Dear all,

I have a query regarding the D measure, - Index of Dispersion. I understand
that it is based on the distribution of the frequencies of word-type over n
number of sub-corpora (categories). D= 0.000 when all occurrences of the
word-type are found in a single category regardless of frequency. The value
1.000 indicates that the word-type's frequency of occurrence is distributed
over the n categories eactly proportionally to the total number of tokens in
those categories.

Do the categories or sub-corpora have to be equal in token size?? - My
previous sentence would suggest not, but in most studies I have heard about
the sub-corpora have always been equal!!?

The formular is given as:

D=[log(sumpi) - (sumpi)/sumpi]/logn

where

n=the number of categories
i=the category number,1,2,3....n
pi = the probability of a token in the ith category and pilogpi=0 for pi=0.

Thanks in advance,

James Purchase
Language Centre
HKUST

Next message: RS-FORSYTH@wpg.uwe.ac.uk: "taggers"
Previous message: Eric Brill: "EMNLP Conference"