Corpora: Subjective familiarity and objective frequency counts

From: Adam Kilgarriff (Adam.Kilgarriff@itri.brighton.ac.uk)
Date: Wed Sep 06 2000 - 17:36:35 MET DST

  • Next message: Nancy M. Ide: "Corpora: SIGLEX Workshop on Word Senses and MultiLinguality: Program"

    Bruce,

    > Does anyone know of research examining the correlation between subjective
    > assessments of familiarity/frequency (i.e., how often do you see, hear,
    > read, write, speak this word?) and objective frequency counts (based on
    > large corpora)?

     you tackle a big topic here.

    You might say frequncy is only of interest because it serves as a
    proxy for salience (aka subjective assessments of familiarity) - which
    cannot be straightforwardly measured. So, the critical thing is,
    where does corpus freq. fall down as a good proxy for salience.

    Just as there are severe limits to how far you can go with the notion
    of a corpus being representative, so there are limits to 'salience' -
    different words are salient to different degrees for different people,
    and you really couldn't get professional cricketers and computer
    programmers to agree on the relative salience of "stump" and
    "interface". "Representative of what" translates to "salient for whom?"
    The moral: don't take corpus frequencies too seriously. Beyond the
    first few thousand items, a small change in sampling policy will
    produce quite different frequency lists.

    One interesting proposal is that a corpus of children's language (or
    language written for children) is a source of frequencies that will
    better correspond to salience, than a corpus of adult language.
    Compare "thumb" (BNC count: 1,363) and "government" (BNC count: 66,894).
    In any corpus aiming at anything like representativeness, "government"
    will be far more frequent. Arguably, "thumb" is more salient -
    presenting as it does a clear, simple image, familiar to every member
    of the language community from a very early age. This relates to there
    being more children's stories featuring thumbs than governments, and
    to it being closer to a cognitive-psychology "basic level object", and
    also to the order in which we learn words, and thereby how deep they
    lie in our conceptualisation of the world.

    Then there are snags like derivational morphology: "quick" (adj) has
    BNC freq 5,920 whereas "quickly" has 12,381, but it's perverse to
    argue that "quickly" is more salient. Indeed, wghhat does salinece
    attach to: words, stems, or (in the other direction) word senses?

    At Longman, we certainly thought long and hard about these issues
    before deciding to publish frequency band info (In LDOCE 3, 1995) and
    when deciding how to implement the ordering of senses: "most
    frequent first" vs. "most salient first".

    Psycholinguistics argue they can measure salience with, eg, time taken
    in lexical decision tasks, and that this is closer to the
    psychological truth than corpus frequencies. But the data is
    expensive to come by and still leaves lots of questions unanswered.
    They do have lots of experience of experimental paradigms in this
    territory (see many issues of Jnl of Psycholinguistics, work by
    Tanenhaus and Seidenberg among many others). I don't know of
    published work outside that paradigm on the topic.

    Depending, as ever, on corpus composition, raw frequency is often a
    less good proxy for salience than document frequency -- number of docs
    a word occurs in -- since it curbs the worst excesses of
    low-salience words occurring with high frequency because they are used
    a lot in a single specialised document. However, there are also
    general patterns whereby verbs and prepositions are more evenly spread
    through the language than nouns and pronouns, so counting doc
    frequency will tend to push verbs and prepositions higher up the freq
    list relative to nouns and pronouns - who's to say whether that's a
    good thing or not!

    >
    > I know of one such paper by the psycholinguist Paul Luce (I will provide
    > the reference when I can find it again). Any other pointers or comments on
    > this issue are welcome.
    >
    > My specific interest is in the relationship between the prescribing
    > frequency of specific drugs (drug names) and health professionals'
    > subjective familiarity with those same names. This kind of information is
    > very important in psycholinguistic research, where the effects of word
    > frequency can be quite overpowering.
    >
    > On a related note, I'd appreciate pointers to any corpus of
    > medical/pharmaceutical/nursing literature that might serve as the basis for
    > an empirical count of drug names in the professional or scientific press.

    not sure this would be worth much - what would be the population from
    which you would ideally draw your sample? - and, can you get anything
    like that in reality? Minor differences will produce drastically
    different frequencies of drug names, simply depending on which
    specialisations get represented in your corpus.
     
    Adam

    -- 
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    Adam Kilgarriff                                
    Senior Research Fellow                         tel: (44) 1273 642919     
    Information Technology Research Institute           (44) 1273 642900 
    University of Brighton                         fax: (44) 1273 642908
    Lewes Road                        
    Brighton BN2 4GJ         email:      Adam.Kilgarriff@itri.bton.ac.uk
    UK                       http://www.itri.bton.ac.uk/~Adam.Kilgarriff
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    



    This archive was generated by hypermail 2b29 : Wed Sep 06 2000 - 17:34:22 MET DST