Helsinki Corpus of Swahili released
The Helsinki Corpus of Swahili (HCS) has been released and is
available at the Language Bank of Finland for academic research
purposes on an interactive Linux server and via a web interface,
WWW-Lemmie. All usage requires a personal user account.
HCS is an annotated corpus of Standard Swahili text. It contains news
texts from several current Swahili newspapers as well as from the news
site of Deutsche Welle. It also contains extracts from a number of
books containing prose text, including fiction, education and
sciences. The total size of the corpus is 12.5 million words in 25.000
XML documents. The XML format used is a derivate of TEI.
HCS has been annotated with SALAMA (Swahili Language Manager), a
multi-purpose language management environment, developed at the
University of Helsinki by Arvi Hurskainen, Professor of African
languages. The corpus contains information of such features as the
base form of the word (lemma), part-of-speech, and morphology,
including noun class affiliation and verb morphology. It also contains
the etymology of loan words and glosses in English.
For more information about the corpus (and a link to the web-based
application form), go to:
http://www.csc.fi/kielipankki/aineistot/hcs/index.phtml.en
Note that commercial use of the corpus, including the interactive
use of SALAMA, is possible, but must be negotiated separately with
Professor Hurskainen (ahurskai AT ling DOT helsinki DOT fi).
Best regards,
Mickel Grönroos and Manne Miettinen
The Language Bank of Finland
at the Finnish IT center for science CSC
Arvi Hurskainen
Professor of African languages, University of Helsinki
-- Kielipankki | Språkbanken i Finland | The Language Bank of Finland The Finnish IT center for science CSC PL 405 (Tekniikantie 15 a D), 02101 Espoo, Finland, +358-9-4572237
This archive was generated by hypermail 2b29 : Mon Oct 25 2004 - 15:16:55 MET DST