Corpora: T-score

From: Tomaz Erjavec (Tomaz.Erjavec@ijs.si)
Date: Fri Apr 21 2000 - 16:01:27 MET DST

  • Next message: Tom Emerson: "Corpora: T-score"

    Hi,

    Ha Le An writes:
    > Sorry to write this letter, but after a lot of attempts, I can not find a
    > clear definition for T-score. Can anybody help me?

    There was a discussion on T-score already on the Corpora list.
    See the archives at http://www.hit.uib.no/corpora/1999-4/

    I enclose the post by Jem Clear, it's a good one.

    Enjoy!

    Tomaz

    -- 
    Tomaz Erjavec                  | Dept. for Intelligent Systems E-8
    email: tomaz.erjavec@ijs.si    | Jozef Stefan Institute
    www:   http://nl.ijs.si/et/    | Jamova 39
    fax:   (+386 61) 219-385       | SI-1000 Ljubljana, Slovenia
    

    From: Jem Clear <jem@cobuild.collins.co.uk> Sender: owner-corpora@lists.uib.no To: corpora@lists.uib.no Subject: Re: Corpora: T-score in collocational analysis Date: Sun, 12 Dec 1999 17:19:23 GMT

    Let me put Tony Berber Sardinha and Gordon and Pam Cain out of their misery! The URL that Gordon and Pam Cain referred to won't bring up the bit of text they were hoping to find. But I reproduce it here below in its entirety. It's very short and it was written off the top of my head as an aid to people who use the CobuildDirect corpus facilities and who used to email me asking "Er.... what are those MI and t-score numbers that appear on the screen when I ask for collocations?"

    Noe of what follows is my own: it is a potted summary of what I understood from Ken Church, from whom I got the t-score formula. Ken Church and Bill Gale (et al) used this statistic over a decade ago and published papers on its use. Annoyingly (in this new cyber world) I routinely refer people to

    Church, K.W., W.Gale, P.W.Hanks, D. Hindle "Using Statistics in Lexical Analysis" in Uri Zernik (ed.) <italic>Lexical Acquisition: Using On-line Resources to Build a Lexicon</italic>. Hillsdale: Lawrence Erlbaum, 1991.

    and I routinely get a reply that goes like "Oh, I can't seem to find this paper -- do you know anywhere else I can read up about this MI and t-score stuff?". Partly for this reason I wrote the lightweight (and possibly inaccurate) "quick guide" below. I am *not* a statistician.

    Cheers

    Jem Clear Electronic Development Director phone: +44 (0)121-414-3926 Collins Dictionaries fax: +44 (0)121-414-6203 Westmere, 50 Edgbaston Park Road email: jem@cobuild.collins.co.uk Birmingham, B15 2RX, UK WWW: www.cobuild.collins.co.uk ----------

    The two statistical measures of significance which are used by the collocations feature of the CobuildDirect service are explained below in layman's terms. It is not really possible to explain the complete statistical background to the use of Mutual Information and t-scores here.

    The output you see will be in four columns. The first column lists each collocate. The second column shows the total independent frequency of that collocate in the corpus. The third column shows the frequency with which the node and the collocate appear together (i.e. withing the specified span) in the corpus. The fourth column shows the statistical significance score (either Mutual Information or t-score as selected by the user). The collocates are listed in descending order of significance.

    -----------------------

    Let us work through some example data (taken from a 20m word corpus) for the word "post".

    It co-occurs with many words, among which are "the", "office" and "mortem".

    The observable facts are that "post" has an overall corpus freq of 2579 (let's refer to this as f(post)=2579) and also

    f(office) = 5237 f(the) = 1019262 f(mortem) = 51

    We also observe the number of times these words co-occurred with "post" (for shorthand I'll write j(the) = 1583 to mean that "the" occurred with "post" 1583 times: this is the "joint" frequency). So

    j(the) = 1583 j(office) = 297 j(mortem) = 51

    Now if we were to list the collocates of "post' by raw frequency of co-occurrence we would order them according to j(x), as above. Of course, a full collocation listing of "post" in this form would have many other words with intermediate frequencies -- we are just focussing on these three words for the moment. But the ordering shown above doesn't tell us anything much about the strength of association between "post" and these other words: it is simply a reflection of the basic overall frequency of the collocating words (i.e. "the" is much more frequent than "office" which is much more frequent than "mortem"). We just showed that in the f(x) list! This is true in general: ordering collocates by j(x) simply places words like "the", "a", "of", "to" at the top of every collocate list. What we would like to know is

    ------------------------------------------------------------------------ IMPORTANT QUESTION: to what extent does the word "post" condition its lexical environment by selecting particular words with which it will co-occur? ------------------------------------------------------------------------

    We can compare the relative frequencies of what we observed with what we would expect under the null hypothesis:

    ------------------------------------------------------------------------ NULL HYPOTHESIS: the word "post" has no effect whatsoever on its lexical environment and the frequencies of words surrounding "post" will be exactly (give or take random fluctuation) the same as they would be if "post" were present or not. ------------------------------------------------------------------------

    That is, if "the" has an overall relative frequency of 1 in 20 (about 1m occurrences in a 20m word corpus -- see f(the) above) then we can expect "the" to occur with the same relative frequency in a subset of the corpus which is the 4 words either side of "post": hence under the null hypothesis we would expect j(the) to be

    ( f(post) * span ) * relative_freq(the)

    which is

    (2579 * 8) * (1 / 20) = 20632 / 20 = 1031

    So under the null hypothesis we would expect j(the) to be 1031. We actually observed j(the) to be 1583, which is rather higher, and we could simply express the difference as ratio (of observed to expected joint frequency) thus:

    1583/1031

    This is the Mutual Information score and it expresses the extent to which observed frequency of co-occurrence differs from expected (where we mean "expected under the null hypothesis"). Of course, big differences indicate massive divergence from the null hypothesis and indicate that "post" is exerting a strong influence over its lexical environment.

    BUT BUT BUT! there is Big Problem with Mutual Information: suppose the word "egregious" appears just once with "post" (not an unreasonable event) in the corpus. And "egregious" may have a very low overall freq:

    f(egregious) = 3

    Now we carry out the sums to calculate the expected j(egregious) figure. I can assure you it will be a small number! It is:

    ( f(post) * span ) * relative_freq(egregious)

    (2579 * 8) * ( 3 / 20000000)

    = 0.0030948

    Now you'll see that even if "egregious" occurs just once in the vicinity of "post" the observed j(egregious) will be 323 times more than the expected joint frequency, and the mutual information value will be high. Common sense tells us that since words cannot appear 0.0030948 times -- they either occur zero or one times, nothing in between -- that claiming that "post"+"egregious" is a significant collocation is rather dubious.

    In general, the comparison of observed j(x) and expected j(x) will be very unreliable when values of j(x) are low; this is common sense, too. Just because I've seen these two words together once in 20m words doesn't give me much confidence that they are strongly associated: I'd need to see them together several times at least before I could start to feel at all secure in claiming that they have some sort of significant association.

    Now here comes T-score. We can calculate a second-order statistic which is, crudely, this:

    ------------------------------------------------------------------------ IMPORTANT QUESTION: how confident can I be that the association that I've measured between "post" and "egregious" is true and not due to the vagaries of chance? ------------------------------------------------------------------------

    T-score answers this question. It takes account of the size of j(x) and weights its value accordingly. A high T-score says: it is safe (very safe/pretty safe/extremely secure etc according to value) to claim that there is some non-random association between these two words.

    So t-scores are higher when the figure j(x) is higher. In the case of "egregious" we would get a very low t-score. In the case of "the" the t-score might be quite high, but not huge because "the" doesn't have that strong an association with "post". "office" gets a really high t-score because not only is the observed j(office) way higher than expected, but we seen a goodly number of such co-occurrences, enough to be pretty damn sure that this can't be due to some freak of chance.

    In practical terms, raw frequency or j(x) won't tell you much at all about collocation: you'll simply discover what you already knew that "the" is a *very* frequenct word and seems to co-occur with just about everything. MI is the proper measure of strength of association: if the MI score is high, then observed j(x) is massively greater then expected, BUT you've got to watch out for the low j(x) frequencies because these are very likely to be freaks of chance, not consistent trends. t-score is best of the lot, because it highlights those collocations where j(x) is high enough not to be unreliable and where the strength of association is distinctly measurable.

    Try the different measures: you'll soon see the difference. Raw freq often picks out the obvious collocates ("post office" "side effect") but you have no way of distinguishing these objectively from frequent non collocations (like "the effect" "an effect" "effect is" "effect it" etc). MI will highlight the technical terms, oddities, weirdos, totally fixed phrases, etc ("post mortem" "Laurens van der Post" "post-menopausal" "prepaid post"/"post prepaid" "post-grad") T-score will get you significant collocates which have occurred frequently ("post office" "Washington Post" "post-war", "by post" "the post").

    If a collocate appears in the top of both MI and t-score lists it is clearly a humdinger of a collocate, rock-solid, typical, frequent, strongly associated with its node word, recurrent, reliable, etc etc etc.

    Jem Clear June 1995



    This archive was generated by hypermail 2b29 : Fri Apr 21 2000 - 16:00:34 MET DST