NEW RELEASE from the LINGUISTIC DATA CONSORTIUM and CELEX

Centre for Lexical Information (celex@mpi.nl)
Thu, 7 Dec 1995 20:18:01 +0100 (MET)

Announcing a
NEW RELEASE from the
LINGUISTIC DATA CONSORTIUM
and the
CENTRE FOR LEXICAL INFORMATION

This message announces the Second Release of the CELEX CD-ROM with
lexical data from the Dutch Centre for Lexical Information and the
Linguistic Data Consortium.

This CD-ROM contains an enhanced, expanded version of the German
lexical database (2.5), featuring approximately 1000 new lemma
entries, revised morphological parses, verb argument structures,
inflectional paradigm codes, and a corpus type lexicon. A complete
PostScript version of the German Linguistic Guide is also included, in
both European A4-format and American Letter format. For German, the
total number of lemmas included is now 51,728, while all their
inflected forms number 365,530.

Moreover, phonetic syllable frequencies have been added for (British)
English and Dutch. Apart from this, and the provision of frequency
information alongside every lexical feature, no changes have been made
to the Dutch and English lexicons.

Complete AWK-scripts are now provided to compute representations not
found in the (plain ASCII) lexical data files, corresponding to the
features described in the CELEX User Guide, which is included on the
CD as well.

For each language, i.e. English, German and Dutch, the CD-ROM contains
detailed information on the orthography (variations in spelling,
hyphenation), the phonology (phonetic transcriptions, variations in
pronunciation, syllable structure, primary stress), the morphology
(derivational and compositional structure, inflectional paradigms),
the syntax (word class, word-class specific subcategorisations,
argument structures), and word frequency (summed word and lemma
counts, based on recent and representative text corpora) of both
wordforms and lemmas. Unique identity numbers allow the linking of
information from different files with the aid of an efficient,
index-based C-program.

Like its predecessor, the CD-ROM is mastered using the ISO 9660 data
format, with the Rock Ridge extensions, allowing it to be used in VMS,
MS-DOS, Macintosh and UNIX environments. As the new release does not
omit any data from the first edition, the current release will replace
the old one.

Institutions that have membership in the LDC during the 1995 or 1996
Membership Years will be able to receive CELEX for research purposes
only at no additional charge, in the same manner as all other text and
speech corpora published by the LDC.

Non-members can receive a copy of CELEX for research purposes only for
a fee of $150. If you would like to order a copy of this corpus,
please email your request to ldc@unagi.cis.upenn.edu, or fax it to
(215) 573-2175. If you need additional information before placing your
order, or would like to inquire about membership in the LDC, please
send email or call (215) 898-0464.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL
http://www.cis.upenn.edu/~ldc. More information specific to CELEX can
be accessed via hyperlinks from this Home Page. Information is also
available via ftp at ftp.cis.upenn.edu under pub/ldc; for ftp access,
please use "anonymous" as your login name, and give your email address
when asked for password.

A brief overview of the revised German data on the CD is given below:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

THE GERMAN DATABASE

When starting to use the German database, the user first has to choose
between three so-called `lexicon types':

- a lemma lexicon
- a wordform lexicon
- a corpus type lexicon

Each lexicon type uses a specific kind of entry. The CELEX lemma
lexicon is the one most similar to an ordinary dictionary since every
entry in this lexicon represents a set of related inflected words. In
a lexicon, a lemma can be represented by using a headword (cf.
traditional dictionary entries) such as, for example, `helfen' (help)
or `Hund' (dog), or by a stem such as, for example, 'helf' or 'Hund'.
The wordform lexicon yields all possible inflected words: every entry
in the lexicon is an inflectional variant of the related headword or
stem. So, a wordform lexicon contains words like `helfe', `hilft',
`geholfen', `huelfe', `Hundes', `Hunde' and so on. A corpus type
lexicon, on the other hand, simply gives you an ordered list of all
alphanumeric strings found in the corpus with raw string counts,
undisambiguated for relations to either lemmas or wordforms.

For all types of lexicons, the user may subsequently select any number
of columns -- from approximately 200 database columns -- combining
information on the orthography, phonology, morphology, syntax and
frequency of the entries.

LEXICAL DATA, GERMAN

The lexical data that can be selected for each entry in the different
German lexicon types can be divided into five categories: orthography,
phonology, morphology, syntax and frequency.

----------------------------------------------------------------------
Orthography - with or without diacritics
(spelling) - with or without word division positions
- number of letters/syllables

Phonology - phonetic transcriptions which use different notations
(pronunciation) like SAMPA or CPA and include:
- syllable boundaries
- primary stress markers
- consonant-vowel patterns
- number of phonemes/syllables

Morphology - Derivational/compositional:
(word structure) - division into stems and affixes
- flat or hierarchical representations
- Inflectional:
- stems and their inflections

Syntax - word class
(grammar) - subcategorisations per word class

Frequency - Mannheim frequency(*)
----------------------------------------------------------------------
(*) These frequency data are based on the 6 million word corpus
compiled by the Institut fuer Deutsche Sprache in Mannheim, Germany.

EXAMPLE DATA, GERMAN

An arbitrary query using a small German lemma lexicon (that is, one
with very few columns) might yield the following result:

---------------------------------------------------------------------
Headword Pronunciation Morphology: M: Cl Freq
Structured Segmentation Cl
----------- ---------------- ------------------------ --- -- ----
helfen "hEl-f@n (helf) V V 1225
Helfer "hEl-f@r ((helf),(er)) Vx N 134
hellaeugig "hEl-Oy-gIx ((hell),(Auge),(ig)) ANx A 0
hellblau "hEl-blau ((hell),(blau)) AA A 28
Hellseher "hEl-ze:-@r (((hell),(seh)),(er)) AVx N 20
hellseherisch "hEl-ze:-@-rIS (((hell),(seh)),(erisch)) AVx A 0
hellwach "hEl-vax ((hell),(((wach),(e)))) AVx A 13
Helm "hElm (Helm) N N 22
Hund "hUnt (Hund) N N 364
Huendchen "hYnt-x@n ((Hund),(chen)) Nx N 7
hundekalt "hUn-d@-kalt ((Hund),(e),(kalt)) NxA A 0
hundemuede "hUn-d@-my:-d@ ((Hund),(e),(muede)) NxA A 3
Hundeschnauze "hUn-d@-Snau-ts@ ((Hund),(e),(Schnauze)) NxN N 1
Hundesteuer "hUn-d@-StOy-@r ((Hund),(e),(Steuer)) NxN N 6
Hundewetter "hUn-d@-vE-t@r ((Hund),(e),(Wetter)) NxN N 0
Huendin "hYn-dIn ((Hund),(in)) Nx N 7
huendisch "hYn-dIS ((Hund),(isch)) Nx A 2
Huene "hy:-n@ (Huene) N N 13
huenenhaft "hy:-n@n-haft ((Huene),(n),(haft)) Nxx A 4
Hunger "hU-N@r (Hunger) N N 102
Hungerkur "hU-N@r-ku:r ((Hunger),(Kur)) NN N 5
Hungerlohn "hU-N@r-lo:n ((Hunger),(Lohn)) NN N 6
hungern "hU-N@rn ((Hunger)) N V 33
Hungersnot "hU-N@rs-no:t ((Hunger),(s),(Not)) NxN N 23
Hungerstreik "hU-N@r-Straik ((Hunger),((streik))) NV N 14
---------------------------------------------------------------------

Richard Piepenbrock
CELEX Project Manager

C
-- C E L E X --
-- The Centre for Lexical Information -- C
C C C
C
Max Planck Institute for Psycholinguistics C CCCCCC
Wundtlaan 1 C CCCCCCCCCCCCC
6525 XD NIJMEGEN C C C CCCCCCCCCCCCCCCC
The Netherlands CCCCCCCCCC CC
C CCCCCCCC
Tel: (+31) (0)24 - 3615797 CCCCCCCC
Fax: (+31) (0)24 - 3521213 CCCCCCCC
CCCCCCCC
CCCCCCCC
E-mail: celex@mpi.nl CCCCCCCC
CCCCCCCC
WWW-page: http://www.kun.nl/celex/ CCCCCCCC
CCCCCCCCC
CCCCCCCCCCC