Corpora: Tokenizer for English/French - lines of credit.

Noemi Preissner (noemi@CoLi.Uni-SB.DE)
Wed, 29 Jul 1998 12:34:48 +0200 (MET DST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Henning Reetz: "Re: Corpora: Corpus Linguistics User Needs"
Previous message: Geoffrey Sampson: "Re: Corpora: Corpus Linguistics User Needs"

Hi,

some time ago I posted the following:

> can anybody give me a hint where I can find tokenizers for French and/or
> English text? Even rather simple scripts (e.g. perl) would be helpful!
> (Please don't recommend scripts splitting on white space only though ... )

I got some really helpful answers:

Oliver Mason sent me a sed-script that was used for the Penn treebank.
Daniel Ridings sent me two lex-files, one for English, one for French.

All these scripts do a good job, but I did not use them, since a colleague
turned out to have a nice perl script with more or less the same functio-
nality, so I decided to give preference to "our own product".
I very much appreciated your help though! (Thanks also to Mike Scott who
suggested to use WordSmith Tools.)

Noemi

noemi@coli.uni-sb.de

Next message: Henning Reetz: "Re: Corpora: Corpus Linguistics User Needs"
Previous message: Geoffrey Sampson: "Re: Corpora: Corpus Linguistics User Needs"