Corpora: New Technical Paper from UCREL

Dr Andrew Wilson (eia018@comp.lancs.ac.uk)
Wed, 30 Jul 1997 15:00:36 +0100

The following new Technical Paper has just been published by UCREL:-

Volume 10

Kaoru Takahsahi:
A STUDY OF TEXT TYPOLOGY: MULTI-FEATURE AND MULTI-DIMENSIONAL ANALYSES

60 pages, comb binding, 13 tables, 6 figures, published July 1997
ISBN: 1 86220 035 1

An abstract is appended to this message.

The volume may be purchased at the price of 4.50 UK pounds from:

Chris Needham
UCREL Technical Papers
Department of Computing
Lancaster University
Lancaster LA1 4YR
United Kingdom

e-mail: cn@comp.lancs.ac.uk

The other volumes in the series are still available. For details and
prices, please see the series Web page:

http://www.comp.lancs.ac.uk/ucrel/tech_papers.html

-------------------------------------------------------------------------
AUTHOR'S ABSTRACT:

This paper is concerned with text typology. The LOB Corpus, which is a
million-word collection of British English texts, is addressed for the
study of characterizing text types and identifying linguistic
characteristics in each text type. By means of multivariate analysis,
the variation of the occurrence of the assigned linguistic features
among genre categories yields the classification and systematization of
genre categories, and also makes it explicit to specify the
characteristics of linguistic features among classified groups. The
criteria of the classification are exclusively based on the dimensions
which are statistically revealed by the multivariate analysis, and
afterwards the groupings are interpreted linguistically. As a result of
the analysis, two main dimensions, i.e., ``narrative versus
non-narrative concern" and ``specification of content versus
generalization of content" enable the classification of three groups
among genre categories in the LOB Corpus.

As the second stage of this paper, focussing on the tag sequences in
the LOB Corpus, the research on text types shifts to the syntactic
level. This is carried out by a similar statistical methodology,
whereby the syntactic distinction between contrastive linguistic
groups, i.e., fiction and exposition is made explicit.

Lastly, I touch upon discourse analysis. The linguistic features
concerning semantics, e.g., proper nouns, common nouns etc., enable
more sophisticated classification of text types macroscopically.

This paper concludes with a future plan of research concerning a
multi-feature and multi-dimensional approach.