Corpora: PhD thesis announcement

Reinhard Rapp (rapp@psycho.uni-paderborn.de)
Tue, 29 Jul 1997 17:54:23 +0200

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Digital Resources for the Humanities: "Corpora: Conference announcement: please post"
Previous message: RICHARD FORSYTH: "Corpora: tetragrams & other four-letter words"

Dear colleagues,

this is to announce the availability of my PhD-thesis
"The computation of associations: a corpus-linguistic approach /
Die Berechnung von Assoziationen: ein korpuslinguistischer Ansatz"
on the World Wide Web. The URL is:

http://www.fask.uni-mainz.de/user/rapp/papers/disshtml/main/main.html

Please note that the thesis is in German language (links to
related papers published in English can be found on my home page).
For those who prefer a paper version, I have several free copies
of the manuscript available (please send me an e-mail if you are
interested). The thesis is also available as a book:
Rapp, R. (1996). Die Berechnung von Assoziationen: ein
korpuslinguistischer Ansatz. Hildesheim; Zuerich; New York:
Olms. ISBN: 3-487-10252-8; 272 pages; price: DM 54 / ~30 US$.

Please find below an abstract and a commented table of contents.

Best regards,

Reinhard

____ ____
/ __ \ / __ \ Reinhard Rapp, rapp@psycho.uni-paderborn.de
/ /_/ / / /_/ / Universitaet-GH Paderborn, Fachbereich 2
/ __/ / __/ Warburger Strasse 100, D-33095 Paderborn
/ /\ \ / /\ \ Tel.: 05251/60-2908, Fax: 05251/60-3528
/_/ \_\/_/ \_\ http://www.fask.uni-mainz.de/user/rapp

-------------------------------- ABSTRACT -----------------------------

It is shown that basic language processes such as the production of
free word associations, the cloze task (completion of missing words
in sentences), and the forming of syntactical word classes can be
simulated using statistical models which analyze the distribution
of words in large text corpora.

The free word associations as produced by subjects on presentation
of single stimulus words can be predicted on the basis of the common
occurrences of words in texts by applying the law of association
by contiguity which is well known from psychological learning theory.
By using appropriate text corpora this approach was applied with good
success to both English and German. Language specific differences
in the associative behavior of American and German subjects were
reproduced.

Three applications are described which show the practical relevance of
the algorithm: One is the generation of suitable search terms for
information retrieval in bibliographic data bases, the second is
to predict the associations triggered by the words used in advertisements,
and the third is the automatic generation of dictionaries from bilingual
texts.

In other parts of the thesis modified versions of the algorithm are
used for part-of-speech tagging, statistical translation, and for the
prediction of missing words in texts.

------------------- TABLE OF CONTENTS (WITH COMMENTS) -------------------

1. STATISTICAL METHODS IN NATURAL LANGUAGE PROCESSING
(Argues that language acqusition can be considered as the detection
of regularities in the distribution of words in perceived language.
These regularities are reproduced during language production.)

2. THE COMPUTATION OF GERMAN WORD ASSOCIATIONS
(Word associations as produced by German subjects are predicted
on the basis of the co-occurrences of words in large corpora.)

3. THE COMPUTATION OF ENGLISH WORD ASSOCIATIONS
(It is shown that the same algorithm works for both English and German.)

4. IMPROVEMENTS FOR THE PREDICTION OF WORD ASSOCIATIONS
(Gives a detailed analysis of the distribution of stimulus/response-pairs
in corpora.)

5. PREDICTION OF ASSOCIATIONS TO MORE THAN ONE STIMULUS WORD
(The responses to several stimulus words can be predicted by
superimposing the responses to single stimulus words.
The algorithm is used for solving crossword-puzzles.)

6. GENERATION OF SEARCH TERMS FOR INFORMATION RETRIEVAL
(It is shown that the terms used by professional searchers
for information retrieval in bibliographic databases are
associations to the words in the problem description.)

7. ASSOCIATIVE TEXT ANALYSIS OF ADVERTISEMENTS
(Tries to predict the associative behavior of a subject upon
presentation of an advertisement.)

8. GENERATION OF DICTIONARIES FROM BILINGUAL TEXTS
(Argues that word translations can be considered as associations
between words of translated texts. The method of dynamic
programming is described.)

9. ASSOCIATIVE COMPLETION OF MISSING WORDS
(Statistics about the co-occurrences in small text windows are
used to predict missing words in texts.)

10. SPELLING CORRECTION
(The ability to predict missing words is used for spelling
correction.)

11. PART-OF-SPEECH TAGGING
(Includes a section on the automatic formation of word classes.)

12. IMPLEMENTATION OF THE SIMULATION PROGRAMS
(Describes various techniques to improve the efficiency of the
simulation programs.)

13. SUMMARY

Appendix A: DESCRIPTION OF GERMAN AND ENGLISH CORPORA
(Includes addresses of sources and price information.)

Appendix B: TABLES OF ASSOCIATION NORMS FOR GERMAN STIMULUS WORDS
(Includes the associations of subjects to single
stimulus words and to pairs of stimulus words.)

Appendix C: GERMAN SUFFIXES AS PREDICTORS FOR PART-OF-SPEECH
(Shows the distribution of 113 suffixes on
6 different parts-of-speech.)

Appendix D: WORD CLASSES FOR GERMAN PART-OF-SPEECH TAGGING
(With complete lists of all closed-class words.)

Next message: Digital Resources for the Humanities: "Corpora: Conference announcement: please post"
Previous message: RICHARD FORSYTH: "Corpora: tetragrams & other four-letter words"