Corpora: lexical stylistics in Polish EFL corpus

Przemyslaw Kaszubski (przemka@amu.edu.pl)
Thu, 12 Mar 1998 12:25:51 +0100 (MET)

Hello Everyone,

My name is Przemek Kaszubski and I am a EFL writing teacher and
learner-corpora researcher at School of English at Poznan
University. The head of my department (English Computational
Linguistics) is Prof. Wlodzimierz Sobkowiak, but some of you
may associate my School better with its long-time Director,
Prof. Jacek Fisiak.

I apologise for the length of this message, but the matter is
quite important to me.

I have decided to place a few queries on the list after
noticing a growing amount of bandwidth devoted to learner
corpora and also after exploring the list’s log files where I
found some but fairly general answers to my problems.
I am currently working on a PhD trying to demonstrate that
Polish student writers tend to overgeneralise and oversimplify
in their written English work. I have access to a self-compiled
mini-corpus of their argumentative and expository essay writing
(over 350 essays; 220,000 words and growing); the corpus is
called PICLE and has been developed for the International
Corpus of Learner English (you may refer to my homepage to find
out more on ICLE). As a national rep for the Polish part, I can
also use other resources gathered under ICLE (native and non-
native material of analogous type and size, INCLUDING a corpus
of native English students’ writing). I also have access to
other corpora of comparable size and content, which I have been
painstakingly collecting and acquiring:

1. a corpus of ‘professional’ English writing: so far only
consisting of LOB and BROWN samples (as well as non-copyright
extracts of newspaper articles from the Web); I’m very much
counting on the BNC, which my school has ordered, but this may
take time

QUESTION:
Does anyone know of any British/American free domain or cheap-
priced corpora which could successfully correspond in terms of
genre, topics, style (preferably: exposition/argumentation;
general interest topics; neutral to formal rather than
specialist academic) to my PICLE corpus?

2. a corpus of Polish university and secondary-school students’
compositions in Polish;

3. a corpus of ‘professional’ Polish writing: non-technical
parts of academic papers and other publications; quality press
articles and editorials.

All the corpora can serve for raw-text and POS-tagged analyses.

I’d like to prove my case on the lexical-stylistic level.

QUESTION:

What possible empirical measures of generality and simplicity
of style could I use, given the type of resources I have? I
would welcome any suggestions.

So far I have been pursuing more or less the following
solution:

Concentrating on the English corpora (PICLE and native) I would
like to use a semantic hierarchy such as WordNet (NB. version
1.6 has only just appeared and can ftp’ed from Princeton or
Stuttgart) and use the depths of its hierarchies to study
incremental frequency bands of nouns derived from the corpora.
I expect (as Rundell and Ham discovered in their 1994 article
„A New Conceptual Map of English” published in EURALEX’94
Proceedings) to find out that the Polish corpus will show a
statistically higher average hierarchical position for the
nouns found in it than any other native corpus will. (I may
also test these findings against other non-native corpora to
see if the tendency is universal for the level of proficiency
investigated, i.e. II-IV year English studies roughly
corresponding to advanced EFL). To state it simpler: I expect a
higher relative degree of hyperonymy for Poles than for British
or American native writers (whether student or professional).

QUESTIONS:
1. So far effective hierarchical lexicons have only been
devised for English and (partly) for Dutch, Spanish and Italian
(EuroWordnet); I will not - or so it appears - have a Polish
wordnet to possibly examine transfer mechanisms unless I resort
to the English WordNet (or rather American , since the British
English supplement is not yet ready) or to a Polish MRD which
implements a thesaurus structure. Do you find this procedure
appropriate? (I tend to discount Polish-English typological
differences here since my focus is on single lexical items
rather than phraseology, but perhaps I’m missing a point?)

2. Word Sense Disambiguation: semi-automatic systems are viable
these days but I don’t know of any that I could download and
use reliably and at ease with my data. I will gratefully accept
any reports to the contrary. If I need to disambiguate manually
(which I don’t know yet that I want to), I’d rather do with an
awareness that no existing software can help me. Naturally, my
option so far has been to use the WordNet system of senses:
WordNet 1.5 may be more appropriate since it has been around
and applied for some time, but I’ve yet to test it more
thoroughly against the new WordNet 1.6 (which comes with a
semantically tagged corpus AND a viewer).

As I mention above, I may run into problems over the
distinction between British and American English. PICLE
consists of both, since our School runs largely independent
programs for these varieties. In addition it would be really
interesting to take a dive into these varieties with learner
language. Perhaps the American way of perceiving the world will
appear better in tune with the Polish mentality than the
geographically closer British one? And indeed is there an
(allegedly conditioned) observable discrepancy between the
texts written by students in the British and the American
programs? Such questions are very intriguing and tempting to
examine; if I have to abandon them, I will do that very
reluctantly.

OTHER (yet lexicologically related) WordNet-based measures of
generalisation/simplicity I was thinking to test:

1. the most frequent x nouns (taken in bands again if need be)
used by Poles writing in English show statistically greater
polysemy, i.e. are significantly more general, simple & basic;

2. the most frequent x nouns use by Poles are expected to be
statistically more often the basic synonyms within their
respective synonym sets. Since basic members of synsets (as
lexicological studies demonstrate) are usually characterised by
the highest degree of semantic coverage, their extensive use
might be associated with generalising.

The value(s) of x above will be established empirically and
form part of my final results.

Once again apologies for making it so long. I thought, however,
that my questions would make a better impression in a context.

I welcome any comments, advice, pointers etc. Bibliographical
and methodological suggestions (e.g. on possible
interpretations of my findings) will also be very desirable.

Przemek Kaszubski

e-mail: przemka@main.amu.edu.pl
http://main.amu.edu.pl/~ifauam/skaszub.htm
School of English
Adam Mickiewicz University |--------------------------------|
Al. Niepodleglosci 4 | Visit my corpus |
61-874 Poznan | |
POLAND | linguistics page |
| |
phone (office) +48 61 8528820 |http://main.amu.edu.pl/~przemka |
(home) +48 61 8200272 | |
fax: +48 61 8523103 |--------------------------------|