Corpora: LT TTT: tokeniser software

grover@cogsci.ed.ac.uk
Wed, 31 Mar 1999 23:10:20 +0100

[Apologies if you see this message more than once]

The Language Technology Group of the Human Communication Research
Centre, University of Edinburgh, is pleased to announce the release of
version 1.0 of LT TTT, a Text Tokenisation Tool. LT TTT is available
for free to individuals, researchers and development teams, provided
its usage is restricted to non-commercial purposes.

LT TTT is a text tokenization system and toolset which enables users
to produce a swift and individually-tailored tokenisation of text. It
will be of interest to computational linguists and to linguists who
need to annotate corpora. The LT TTT tools are fully compatible with
our LT XML tools and use their XML-handling API. The version now
available runs on Solaris 2.5.

The main component of the LT TTT system is a program called
fsgmatch. This is a general purpose cascaded transducer which
processes an input stream deterministically and rewrites it according
to a set of rules provided in a grammar file. It can be used to alter
the input in a variety of ways, although the grammars provided with
the LT TTT system are all used simply to add mark-up information.

With LT TTT come grammars to segment texts into paragraphs, segment
paragraphs into words, recognise numerical expressions, mark-up money,
date and time expressions in newspaper texts, and mark-up
bibliographical information in academic texts. These grammars are
accompanied by detailed documentation which allows you to alter
grammars to suit your own needs or develop new rule sets for
particular purposes.

The LT TTT system contains two statistical components: the first is a
part-of-speech tagger which assigns syntactic category labels to
words; the second is a sentence boundary disambiguator which
determines whether a full-stop is part of an abbreviation or a marker
of a sentence boundary. These components are also distributed
separately from our web pages as LT POS.

For more information see

http://www.ltg.ed.ac.uk/software/ttt/

Part of the development work on LT TTT was supported by the
Engineering and Physical Sciences Research Council (EPSRC), grant
reference number GR/L21952.