Corpora: prefabs reprobed :ignore previous McKenny posting

John Anthony Mckenny (Mckenny@CEU.HU)
Thu, 5 Aug 1999 16:27:50 MET-1MEST

Dear Colleagues
Here is a less problematic version of yesterday's posting which
several subscribers found either illegible or with one or two
infinitely long lines. Thanks to Knut Hofland and Jean Hudson for
patience shown.

Good luck

John

Dear corpus crunchers
On May 5 I posted a question on prefabricated phrases in various
genres of academic English and received a set of very helpful
responses from the following scholars: Chris Tribble, Sylvie LeCock,
Maria Wiktorsson, Alejandro Curado, Eleanor Olds Batchelder, Gunter
Lorenz, Antoinette Renouf and Gordon Cain. As I received bibliographic
help from Jane Willis, Frank Smadja and Tony Berber Sardinha around
the same time, I include them as well. Here I would also like to
mention the help of my supervisor, Katie Wales, who got me this far,
and Peter Howarth, both of Leeds University. I alone am responsible
for any dross in this summary.

I at first thought of suggesting a Special Interest Group for
prefabrication but I think the mix of specialists on the CORPORA
mailing list and their different skills and interests provides a much
more challenging testing ground for any hypothesis or new idea. A good
example of this was the brief exchange which took place recently in
response to a question put by David Sorokin ( June 28) about n-grams.
This debate unfortunately petered out within 24 hours.Tony Berber
Sardinha, on the same day, directed David Sorokin to his own (Tony's)
home page for a list of bigrams from the Brown Corpus. Tony's home
page in Sao Paolo is a happy hunting ground for corpus watchers. His
doctoral thesis is available there. Eric Atwell (June 29) suggested
that longer n-grams would be genre-specific and recommended
dictionaries such as Collins English Dictionary and COBUILD as rich
sources of more 'core' multi-word lexical entries which could be used
to build n-grams. Chris Brew on the same day effectively closed the
debate with a short summary of an article by Slava Katz 1996 published
in the Journal of Natural Language Engineering, which, by the sound of
it, we should all be reading. (Check out posting by Chris Brew, June
29). Tantalizingly brief though it was, I found this flurry of four
emails illuminating.

or chunking is going ahead at several seats of learning, namely Lund,
Aston in Birmingham and Louvain. I mention these three universities
because I received replies from them- I know there is much work on
prefabs going on elsewhere. The work of Britt and Warren and
Wiktorsson at Lund is a meticulous sifting 'by hand' of small samples
(mostly taken from computer corpora). Warren's (1999) results show the
proportion of prefabs in texts to be 58.6% for her spoken corpus and
52.3% for her written corpus. It should be pointed out that in the
spoken corpus she counted verb contractions (e.g. I'm, don't, isn't,
let's) or reducibles, as she called them, as prefabs. For
consistency, she also considered the full written forms (I am, do
not, is not, let us) as prefabs. Wiktorsson (1998) reports levels of
prefabrication in a written corpus of 39.4% (using samples from a
novel and from newspapers and magazines). Surprisingly, a corpus of
poetry, in a sister study by Skold, was found to contain 21%
prefabricated language. I should add that Wiktorsson as well as
counting grammatical contractions (he's, shouldn't and so on) also
counted proper names (famous ones such as Bill Clinton and not so
famous ones) as prefabrications. This meticulous work at Lund is
laying the ground for further study. The work of Peter Howarth at
Leeds is similarly painstaking and 'manual'. Like Warren and
Wiktorsson he uses the computer to deliver his samples (from the
tagged LOB and other corpora)but relies on his intuitions for the
analysis of the data. In his study of verb- noun collocations in a
corpus of social science texts, Howarth found that the 63 main verbs
he studied in a subcorpus of LOB revealed some degree of
restrictedness in relation to the nouns they collocate with, in 41% of
the occurrences of the verb studied. Howarth's work also provides a
carefully worked out taxonomy for the categorization of word
combinations. The examples given below in the bibliography of work at
Aston University shows the variety of approaches to prefabs by various
M.Sc. and doctoral students from different years. This is hardly
surprising with Jane Willis, Peter Roe and Frank Knowles all
associated with the Language Study Unit at Aston. The work at Louvain
under Sylviane Granger is changing the face of SLA research and
learner corpora are now all the rage (justifiably, I think: I've built
one myself of about a million words of social science MA dissertations
written by ex-USSR students). Sylvie De Cock very kindly filled me in
on developments at Louvain: she herself is studying prefabrication or
recurrent word combinations in the speech of advanced learners. She
also read a paper focused on prefabs in argumentative essays at the
ICAME conference in Freiburg in May this year. Prefabs or chunking,
then , have relevance for many fields within linguistics. LI
acquisition, SLA, psycholinguistics (spoken language performance
within the usual time constraints of conversation, interview,
telephoning etc. along the spoken-written formal-informal continuum
(see Chris Tribble's Genre article at his homepage where he
discusses the work of Douglas Biber on genre and Michael Hoey's
recent work on semantic prosody).
http://ourworld.compuserve.com/homepages/Christopher_Tribble).

Tribble shows how certain words can assume a local semantic prosody
very different from their usual dictionary definition. He mentions the
words such as 'international' and 'professional' which have pronounced
positive semantic prosody locally in his corpus of international
consultancy proposals. Could this taking on of a special meaning in a
genre be the first step in the process whereby one of the lexical
items in an expression, which was originally composed according to the
free choice principle, gets a specialized or more figurative meaning
(as in Howarth's restricted collocations) or even becomes
delexicalized (as in baked beans). I often enjoyed thinking of words
and the company they keep in terms of an analogy from chemistry:
morphemes as subatomic particles, words as atoms, molecules as
phrasemes. Compounds and reaction equations becoming phrases and
sentences. The study of phraseology would look at the attractions,
repulsions and indifference between words.

But it appears that even in the case of writing, when there is
plenty of time available for composing, there's still a surprising
amount of prefabrication (Wiktorsson, Warren, Howarth).Wanting to
sound natural and to satisfy readers' expectations might play a part
in this as also might the apposite use of set phrases, which follow
the 'house rules' of a genre, functioning as a membership badge. A
lot of the freedom in writing academic English perhaps lies in the
choice of fixed expressions for assembly rather than in the choice of
individual words.
Prefabrication might explain why languages are learnable (Deacon
1998) to young children in the time available without having
recourse to nativist theories. Prefabrication might be what makes
languages learnable and transmittable to new generations. Those nonce
forms that survive, that get transferred into the common stock, might
do so because they are, literally, more memorable than the other
competing nonce forms which pass into oblivion. In the terminology of
modern popular science the phrase which catches on is a successful
meme (Blackmore 1999:40-4). The rhyme, alliteration and assonance
found in many proverbs comes to mind (Glaeser 1988:275) Generally
speaking, the idiom principle is the default one. How I would like to
go forward would be to see if Peter Howarth's findings for a general
social science corpus hold up for various subject areas within the
social sciences. I would like to see to what extent the Lund study of
prefabs could be computerized using 'probes' as Tim Johns describes
them on his EAP homepage http://sun1.bham.ac.uk/johnstf/timeap3.htm
(e.g. Johns shows how 'such as' captures many examples of the use of
superordinates: I have found some small words like 'as' , 'so', 'of' (
cf. Sinclair 1991pp 80ff)as useful probes for building up prefab
lists, bigrams, then trigrams and so on). This would be probabilistic
but increasingly less so. Also findings from the micro-studies could
be fed into the macro studies. I thank all who helped me so readily
and others, who in their discussions on CORPORA provide so many
different ways of seeing things. I apologize if I have misrepresented
anyone's views: this was my reading of the people mentioned.

Altenberg, B 1998 On the Phraseology of Spoken English: the evidence
of recurrent word combinations. In Cowie, A. (ed.) Phraseology:
theory, analysis and applications. Oxford University Press.
Altenberg, B and M. Eeg-Olofsson 1990 Phraseology in Spoken English:
Presentation of a project. In Aarts, J and W Meijs (eds) Theory and
Practice in Corpus Linguistics. Rodopi
Baigent, Maggie (1996) Speaking in chunks: an investigation into the
use of multi-word phrases in spoken English by advanced learners of
English. Unpublished MSc dissertation, Aston University, Language
Studies Unit.
Becker, Joseph D. (1975) The Phrasal Lexicon in Artifical
Intelligence Report No.28; Bolt Baranek and Newman Inc. Report
No.3081
Benson, M. 1985. Collocations and Idioms in Ilson, Robert
(ed) Dictionaries, Lexicography and Language Learning. Oxford:
Pergamon Press
Berber Sardinha, A. 1996 Writing Assessment and
Corpus Linguistics. Paper presented at the Applications of Corpus
Linguistics Seminar, Aston University, 19-4-96.
Blackmore, S. 1999 Meme, Myself, I. New Scientist 13 March 1999
no2177.
Bloor, T. and Bloor, M. 1991 Cultural expectations and
sociopragmatic failure in academic writing. In B Heaton, P Howarth
and P Adams (eds) Socio-cultural Issues in English for Academic
Purposes, 1-13. London: Macmillan
Bolinger, D. 1976. Meaning and Memory, Forum Linguisticum I: 1-14.
Britt, E. and B. Warren. (manuscript) The Idiom Principle and the
Open Choice Principle.
Butler C.S (forthcoming) Repeated word combinations in spoken and
written text in Butler C.S., Gatward, R.A., Conelly, J. H., Vismans
R.M. (eds) A Fund of Ideas: recent developments in functional
grammar. Amsterdam IFOTT
Bygate, M. (1988) Units of oral expression and language learning in
small group interaction. Applied Linguistics 9 (1) , p.59-82
Carter, R. A. & McCarthy M.J. (1988) Vocabulary and Language
Teaching Longman
Carter, R. A. (1987) Vocabulary: Applied Linguistic Perspectives
Allen and Unwin
Charles, S. (1996) The Vocabulary Organiser: a dynamic resource for
learners. Unpublished MSc dissertation, Aston University, Language
Studies Unit.
Choueka, Y., S. T. Klein, and E. Neuwitz 1983 Automatic retrieval of
frequent idiomatic and collocational expressions in a large corpus
Association of Literary and Linguistic Computing (ALLC) Journal,
4:34-38. Fontenelle, T., Bruls, W.,Thomas, L., Vanallemeersch, R. and
J. Jansen 1994 DECIDE Deliverable D-1a: Survey of collocation
extraction tools ms., DECIDE Project, University of
Liege
Clear, J. 1993. From Firth Principles - Computational Tools
for the Study of Collocation in Baker, M. G. Francis & E.
Tognini-Bonelli (eds) Text and Technology. In honour of John
Sinclair. Philadelphia/Amsterdam: John Benjamins.
Conrad, S. 1996 Investigating academic texts with corpus-based
techniques: an example from biology. Linguistics and education, Vol
8, 229-326.
Coulthard, M. (ed.) 1986 Talking about text. Birmingham,
UK: English Language Research, Birmingham University.
Cowie, A.P. and Howarth, P. 1996 Phraseology: a select bibliography.
International Journal of Lexicography 9: 1: 38-51.
Deacon, T. 1998 The Symbolic Species: the co-evolution of language
and the human brain. Harmondsworth: Penguin.
De Cock, S (1998) A recurrent word combination approach to the study
of formulae in the speech of native and non-native speakers of
English. International Journal of Corpus Linguistics 3 (1) 59-80.
De Cock, S , S Granger, G Leech and T McEnery (1998) An automated
approach to the phrasicon of EFL learners. In Granger, S (ed) Learner
English on Computer. London/New York: Addison Wesley Longman, 67-79.
Fielding, R. (1996) Students' use of lexical phrases in their
written work as indicators of degrees of exposure to the target
language. Unpublished MSc dissertation, Aston University, Language
Studies Unit.
Gitsaki, C. 1996 PhD Thesis on the Development of
ESL Collocational Knowledge is now available on the WWW. The URL is:
http://www.cltr.uq.oz.au:8000/users/christina.gitsaki/
Glaeser, R.1988 The grading of idiomaticity as a presupposition for a
taxonomy of idioms. In W Huellen and R Schulze (eds) 1988
Understanding the lexicon, Max Niemeyer Verlag: Tuebingen.
Gramley, S and Paetzold, K. 1992. Words in Combination in A survey
of Modern English. London: Routledge.
Granger, S. 1998 Learner English on Computer. London: Longman.
Granger, Sylviane (1998) Prefabricated patterns in advanced EFL
writing: collocations and formulae. to appear in A. Cowie (ed.)
Phraseology: theory analysis and applications OUP.
Heylen, D., Maxwell, K. and S. Warwick 1993 Collocations,
Dictionaries and MT pp. 69-80 in AAAI BLMT
Howarth, P. 1995 A computer-assisted study of collocations in
academic prose, with special reference to grammatical structure and
stylistic value. Unpublished Ph. D. thesis. University of Leeds.
Howarth, P. 1998 Phraseology and Second Language Proficiency. Applied
Linguistics 19, no. 1: 24-44.
Jackendoff, R. 1995. The Boundaries of the Lexicon in Everaert,
Martin et al (eds.) Idioms: Structural and Psychological
Perspectives. Hillsdale, New Jersey: Lawrence Erlbaum.
Katamba, F. 1993. "12: Idioms and Compounds: The Interaction of
the Lexicon, Morphology and Syntax" from Morphology. London
Maximillan Press.
Kita, K.,Omoto, T., Yano, Y. and Y. Kato 1994
Application of corpora in second language learning: The problem of
collocational knowledge acquisition Second Annual Workshop on Very
Large Corpora, Kyoto.
Kjellmer, G. 1984 Some Thoughts on Collocational Distinctiveness pp.
163-171 in Corpus Linguistics I: Recent Developments in the Use of
Computer Corpora in English Language Research, ed. J Aarts and W
Meijs, Amsterdam: Rodopi
Kjellmer, G. 1987 Aspects of English Collocations pp. 133-140 in
Corpus Linguistics and Beyond, ed. Willem Meijs, Amsterdam: Rodopi
Kjellmer, G. 1991 A Mint of Phrases pp. 111-127 in English Corpus
Linguistics: Studies in Honour of Jan Svartvik ed. K Aijmer and B
Altenberg, London: Longman.
Kjellmer, G. 1994 Dictionary of English Collocations (based on the
Brown corpus) - Clarendon Press - Oxford - 3 volumes.
Lewis, M 1997 Implementing the Lexical Approach LTP
Merkel, M. 1992. Recurrent Patterns in Technical Documentation.
Institutionen foer datavetenskap. Universitetet och tekniska
hoegskolan. Linkoeping
. Merkel, M. 1993? Consistency and variation in
technical translations - a study of translators' attitudes in
Proceedings from the 1st International Translations Studies
Conference.
Merkel, M. Nilsson, B. and Ahrenberg, L. 1994. A
Phrase-Retrieval System Based on Recurrence in The Proceedings from
The Second Annual Workshop on Very Large Corpora. Kyoto. Merkel, M.
1996. Checking Translations for Inconsistency - A Tool for the
Editor in Proceedings from AMTA-96. Montreal.
Mitchell, T. F. 1971. Linguistic 'Goings On': Collocations and Other
Lexical Matters Arising on the Syntagmatic Record. University of
Leeds
. Mochet, Marie-Anne et Charmiane O'Neil (1997) Expressions et
groupements discursifs de l'oral: questions d'inventaire et
propositions didactiques. Unpublished paper given at Colloque
Triangle, March 1997, British Council, Paris
Moon, R. 1998 Fixed Expressions and Idioms in English. A
corpus-based Approach Oxford: Clarendon Press.
Moon, Rosamund (1998) Frequencies and forms of phrasal lexemes in
English in A. Cowie (ed.) Phraseology: theory, analysis and
applications OUP
Mori, Shinsuke, and Makoto Nagao 1996 Word
extraction. A Phrase-Retrieval System Based on Recurrence in The
Proceedings from The Second Annual Workshop on Very Large Corpora.
Kyoto
. Nagao, Makoto, and Shinsuke Mori 1994 A new method of n-gram
statistics for large number of n and automatic extraction of words
and phrases from large text data of Japanese Proceedings of
COLING-94, pp. 611-615
Nattinger, J. R. & DeCarrico, J. E. (1992)
Lexical Phrases and Language Teaching OUP
Ooi, V. 1998 Computer Corpus Lexicography. Edinburgh: Edinburgh
University Press.
Ozkaya, Sema (1996) The role of lexical phrases in
the macro-structuring of written discourse: a comparative study of
medical research articles by native and non-native speakers of
English. Unpublished MSc dissertation, Aston University Language
Studies Unit.
Pawley, A. and Syder, F. 1983. Two puzzles for
linguistic theory: nativelike selection and nativelike fluency, in
Richards, J. C. and Schmidt, R. W. (eds.) Language and Communication
7. 1: 191-226. London: Longman.
Pawley, Andrew. 1985a. Lexicalization
in Tannen, D (ed.) Georgetown University round table on language and
Phonetics. Washington, D.C., Georgetown University Press.
Pawley, A. 1985b. On Speech Formulas and Linguistic Competence. In
Linguas Modernas 12, 84-104
Pazienza, M. T., and P. Velardi 1994 A
not-so-shallow parser for collocational analysis pp. 447-453 in
Proceedings of COLING-94.
Peters, A.M. (1983) The units of language
acquisition CUP
Renouf, A. 1993 A word in time: First findings from
the investigation of dynamic text pp. 279-288 in English Language
Corpora: Design, Analysis and Exploitation, ed. J. Aarts, P. de Haan
and N. Oostdijk, Amsterdam: Rodopi
Renouf, A. and J. Sinclair (1991), "Collocational Frameworks in
English," pp. 128-143 in English Corpus Linguistics: Studies in
Honour of Jan Svartvik, ed. Karin Aijmer and Bengt Altenberg, London:
Longman
. Renouf, A.1992 What do you think of that: A pilot study of
the phraseology of the core words of English pp. 301-317 in New
Directions in English Language Corpora, ed. Gerhard Leitner, Berlin:
Mouton
. Sekine, S. et al. 1992 Automatic learning for semantic
collocations Proc. of Third ANLP.
Sekine, S. 1994 A new direction for sublanguage N. L. P
International Conference on New Methods in Language Processing
(NeMLaP), UMIST, U.K. September 1994.
Sinclair, J 1992 The automatic analysis of corpora pp.379-397 in
Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82,
ed. Jan Svartvik, Berlin: Mouton.
Sinclair, J. (1991) Corpus, Concordance and Collocation OUP
Sinclair, J. 1987. Collocation: a progress report In Steele-Ross (ed.
& introd.); Threadgold-Terry (ed.). Language Topics: Essays in
Honour of Michael Halliday, I & II. Amsterdam : Benjamins. xxxii, 490
Sinclair, J. and Renouf, A. 1991. Collocational Frameworks in
English, In Aijmer, Karin and Altenberg, Bengt (eds.) English corpus
linguistics. London: Longman.
Smadja, F. (1993), Retrieving collocations from text: Xtract,
Computational Linguistics 19:143-177.
Stubbs, M. 1995 Corpus evidence for norms of lexical collocation. In
G. Cook and B. Seidlhofer (eds) Principles and Practice in Applied
Linguistics. London: Oxford University Press.
Stubbs, M. 1996 Text and Corpus Analysis. Oxford: Blackwell.
Sutarsyah, C., P Nation, and G Kennedy 1994 How useful is EAP
vocabulary for ESP: a corpus based case study. RELC Journal, Vol 25
no 2 (Dec) 34-50
Thouvenin, S. (1996) The identification and
exemplification of multi-word units within a technical corpus of
English, including an investigation of nominal groups. Unpublished
MSc dissertation, Aston University, Language Studies Unit.
Thurston, J. and Candlin, C.N. 1998 Concordancing and the teaching of
the vocabulary of academic English. English for Academic Purposes 17
(3) 267-279.
Warren, B 1999 (forthcoming) An alternative view of
stored linguistic knowledge and its relevance to text
composition.Text. (the journal)
Widdowson, H.G. (1990) Aspects of Language Teaching OUP
Wiebe, J.,R. Bruce and Lei Duan 1997 Probabilistic Event
Categorization Journal-ref: Recent Advances in Natural Language
Processing (RANLP-97), European Commission, DG XIII, Tzigov Chark,
Bulgaria, September 1997, pp. 163--170. In Recent Advances in
Natural Language Processing (RANLP-97), European Commission, DG
XIII, Tzigov Chark, Bulgaria, September 1997, pp. 163--170 \\
http://xxx.lanl.gov/abs/cmp-lg/9710008 13kb
Wiktorsson, M.1998 Compositional an Non-Compositional Aspects of
Written and Spoken Texts. Paper presented at the Conceptual
Structure, Discourse and Language Conference (CSDL-4) in Atlanta.
Willis, D (1990) The Lexical Syllabus London:Collins Cobuild
Willis, J (1997) 'Exploring Spoken Language: Analysis Activities for
Trainers and Teachers' in McGrath (ed) Learning to Train Prentice
Hall International
Willis, J and Willis, D (eds) (1996) Challenge
and Change in Language Teaching Heinemann ELT, papers 6 and 7
Winter, E. O.1986 Clause relations as information structure: two
basic text structures in English. In Coulthard 1986 (ed) :88-108.
Yorio C A (1989) Idiomaticity as an indicator of second language
proficiency in Hyltenstam K and Obler L (eds) Bilingualism across
the Lifespan CUP pp55-72