Corpora: CLaRK System - an XML-based System for Corpora Development

From: Kiril Simov (
Date: Mon May 20 2002 - 18:53:03 MET DST

  • Next message: "Corpora: Postdoc at Stanford on learning semantics from text"

    Dear List members,

    I would like to announce the CLaRK System - an XML-based System
    for Corpora Development. It is available on the web page of
    the BulTreeBank Project:

    Please, follow the "CLaRK System" link and then Download.

    Short description:

    CLaRK is an XML-based software system for corpora development.
    The main aim behind the design of the system is the minimization
    of human intervention during the creation of language resources.
    It incorporates several technologies: (1) XML technology;
    (2) Unicode; (3) Regular Cascade Grammars;
    (4) Constraints over XML Documents.

    For document management, storing and querying, we chose the
    XML technology because of its popularity and its ease of
    understanding. The core of CLaRK is an XML Editor, which is
    the main interface to the system. Besides the XML language itself,
    we implemented an XPath language for navigation in
    documents and an XSLT language for transformation of XML documents.

    For multilingual processing tasks, CLaRK is based on an
    Unicode encoding of the information inside the system.
    There is a mechanism for the creation of a hierarchy of
    tokenisers. They can be attached to the elements in the DTDs
    and in this way there are different tokenisers for different
    parts of the documents.

    The basic mechanism of CLaRK for linguistic processing of
    text corpora is the cascade regular grammar processor.
    The main challenge to the grammars in question is how to apply
    them on XML encoding of the linguistic information. The system
    offers a solution using an XPath language for constructing
    the input word to the grammar and an XML encoding of the
    categories of the recognised words.

    Several mechanisms for imposing constraints over XML
    documents are available. The constraints cannot be stated by
    the standard XML technology. The following types of constraints
    are implemented in CLaRK: (1) Regular expression constraints -
    additional constraints over the content of given elements based
    on a context; (2) Number restriction constraints - cardinality
    constraints over the content of a document; (3) Value constraints -
    restriction of the possible content or parent of an element in
    a document based on a context. The constraints are used in
    two modes: checking the validity of a document regarding a set
    of constraints; supporting the linguist in his/her work during
    the building of a corpus. The first mode allows the creation of
    constraints for the validation of a corpus according to given
    requirements. The second mode helps the underlying strategy of
    minimisation of the human labour.

    With best regards,


    Kiril Simov
    BulTreeBank Project
    Linguistic Modelling Laboratory, CLPP,
    Bulgarian Academy of Sciences
    Acad. G.Bonchev St. 25A
    1113 Sofia, Bulgaria

    This archive was generated by hypermail 2b29 : Mon May 20 2002 - 19:06:01 MET DST