Announcement INL 38 Million Words Corpus 1996

Rob van Strien (ROB@rulxho.LeidenUniv.nl)
Fri, 23 Aug 1996 16:00:11 +0000

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Susan Haller: "CFP: AAAI Spring Symposium 1997: Mixed Initiative Interaction"
Previous message: Markku Norberg: "Re: Research on Tokenisation"

INSTITUUT VOOR NEDERLANDSE LEXICOLOGIE

On-line access to INL 38 Million Words Text Corpus of Dutch, for
non-commercial purposes.

The Institute for Dutch Lexicology INL offers you the possibility to
consult a Dutch text corpus of ca. 38 million words, by the
international computer network (Internet). In 1994 and 1995, a 5
Million Words Corpus with diversified composition and a 27 Million
Words Newspaper Corpus have been made accessible in a similar way.
Access is for free for non-commercial purposes.

The 38 Million Words Corpus 1996 consists of three main components:
a component with varied composition (1970-1989), a newspaper
component (Meppeler Courant, 1992-1995) and a legal component
(1814-1989).

The user has the opportunity to define subcorpora, either on the
basis of the parameters (1) corpuscomponent, (2) topic, (3)
publication medium/text type, and (4) period, or on the basis of
selections from text surveys presented at the screen. The user can
ask for the size of each defined subcorpus.

The texts have automatically been annotated with lemma (head word)
and two types of part of speech (POS): a global one (13 POS
categories) and a fine-grained one (with subcategorization)
conformant with the MECOLB standard (EC-project MLAP93-21 MECOLB;
coordinator R. Neumann, Institut fuer Deutsche Sprache, Mannheim). The
MECOLB-tagset for Dutch was developed in cooperation with the TOSCA
Research Group (University of Nymegen), under responsibility of Prof.
dr. J. Aarts.

Most of the data has not been corrected, neither on the level of the
text, nor on the level of POS and headword.

The retrieval system allows you to search for single words or for
word patterns, including some predefined syntactic patterns that can
be changed by the user. There are two query languages, which differ
in formalism. Searches may address the levels of word form, two types
of part of speech, and head word, both separately and in combination
by use of Boolean operators and proximity searches. During the
search, data concerning frequency and distribution over the texts are
provided at several levels. The output most often is a list of items,
or a series of concordances (words in context) with a variable,
user-defined textual context. Sorting facilities may support your
analysis of the output data. With some limitations due to copyright,
the output of your searches can be transfered to your own computer by
e-mail. It is not allowed to transfer complete texts or substantial
text parts.

The providers of the texts have given permission for use of the
texts for non-commercial, research purposes only.

Please note that for an optimal use of the retrieval system, the use
of a VT 220 (or higher) terminal, or an appropriate terminal-emulator
(e.g. Kermit) is recommended.

For access to the corpora, an individual user agreement is to be
signed. There is a separate user agreement for each corpus. An
electronic user agreement form can be obtained from our mailserver
Mailserv@Rulxho.Leidenuniv.NL. Type in the body of your e-mail
message:

SEND [38MLN96]AGREEMNT.USE for the 38 Million Words Corpus 1996
SEND [27MLN95]AGREEMNT.USE for the 27 Million Words Newspaper Corpus
1995
SEND [5MLN94]AGREEMNT.USE for the 5 Million Words Corpus 1994

Please make a hard copy of the agreement form, sign it, keep a copy
yourself, and return a signed copy to: Institute for Dutch
Lexicology INL, P.O. Box 9515, 2300 RA Leiden, The Netherlands. Fax:
31 71 527 2115.

After receipt of the signed user agreement, you will be informed
about your username and password.

If you need additional information, please send an e-mail message to
Helpdesk@Rulxho.Leidenuniv.NL, or send a fax to Mrs. dr. J.G. Kruyt.

Next message: Susan Haller: "CFP: AAAI Spring Symposium 1997: Mixed Initiative Interaction"
Previous message: Markku Norberg: "Re: Research on Tokenisation"