Corpora: cognitive salience in lexicography

Patrick Hanks (hanksp@oup.co.uk)
Tue, 7 Dec 1999 12:08:26 GMT

Dave Carlson said:

>>I recall reading about a study of the work that went into the
>>compilation of the OED, and about how the attention of the
>>compilers was attracted more toward idosyncratic usage rather than
>>toward examples of typical use.

See Hanks 1990:

How is a lexicographer to know what \fIany\fR word means? How are the
public features of word meaning and use to be identified? I suggest
that at least three components are necessary for the construction of
an accurate (or at least usable) account of the meanings and
conventional functions of a word. The first is a body of evidence -
citations, indices, concordances to a corpus, and so on. The second
is the personal knowledge or intuitions about word meanings which
native speakers have, although these are notoriously difficult to
access directly. The third is the body of statements (true or false,
accurate or inaccurate, as the case may be) to be found in existing
dictionaries, grammars, and other language studies.

To take textual evidence first: If we rely on reading-and-marking
methods, dependent on the diligence of human citation readers, a
distorting factor rapidly becomes apparent, as Murray himself noticed
in 1879, when starting work in earnest on OED:

"The editor or his assistants have to search for precious hours for
examples of common words, which readers passed by.... Thus, of
\fIAbusion\fR, we found in the slips about 50 instances: of
\fIAbuse\fR not five". Again, "There was not a single quotation for
\fIimaginable\fR - a word used by Chaucer, Sir Thomas More, and
Milton"

(quoted by K.M.E. Murray (1977): \fICaught in the Web of Words\fR,
pp. 178, 168).

[... If only for] reasons of practical common sense the
lexicographer should focus attention on those conventions of the
language that are socially salient - i.e., those that represent
central and typical patterns of usage. This is of particular
importance in dictionaries intended for foreign learners of a
language, and even more important if the dictionary is intended, like
the one [compiled] at Cobuild in the University of Birmingham, to give
special help with encoding as well as decoding. It is more sensible
to show a learner the patterns of language that are common and typical
than to exemplify odd, eccentric, mannered, or metaphorical uses of
language.

Traditionally, however, dictionaries have tended to do precisely the
opposite: more by accident than design, I suspect. In reading and
marking, psychological salience tends to interfere with social
salience. Psychologically, human beings tend to register the
unfamiliar rather than the familiar, the unusual rather than the
usual. Thus, the citation files of modern dictionary publishers are
full of citations from correspondents for \fItachograph\fR and
\fIayatollah\fR - words which have come into prominence within the
past decade. But no dictionary (yet) can tell us what are the most
common uses of \fItake\fR. \fITake\fR is too familiar for its patterns
to be noticed by humans, and computational evidence in large enough
quantities to be significant is only just becoming available.

REFERENCE:

Hanks, P. (1990): `Evidence and Intuition in Lexicography' in
J. Tomaszczyk and Barbara Lewandowska-Tomaszczyk (eds.) \fIMeaning and
Lexicography\fR, pp. 31-41. Amsterdam, Philadelphia: Benjamins

(The Murray quote has been picked up and used often by other corpus
linguists and computational lexicographers, too.)

*************************************
Patrick Hanks
Oxford University Press
Great Clarendon Street
Oxford OX2 6DP.
email: hanksp@oup.co.uk