Re: [Corpora-List] XML annotation guidelines

From: Chris Brew (cbrew@ling.ohio-state.edu)
Date: Fri Jun 06 2003 - 17:52:20 MET DST

  • Next message: Vlado Keselj: "[Corpora-List] New Ngram package in Perl"

    On Fri, Jun 06, 2003 at 09:35:15AM -0400, Simpson, Rita wrote:
    >
    > Dear Corporist Colleagues,
    >
    > We are in the process of converting our corpus of transcribed
    > academic speech from SGML to XML, and adding additional annotation.
    > Can anyone point us to some standards or (preferably) precedents
    > for XML-ized annotation of:
    >
    > 1) POS tagging
    > and
    > 2) pragmatic markup (e.g., text segments manually identified as
    > 'narrative',
    > 'disagreement', 'request', etc.)
    >
    > Within the TEI guidelines (P4), we've found some suggestions for the
    > POS
    > tagging, (but nothing yet for something like our pragmatic
    > categories), e.g.
    >
    > <s type="sentence">
    > <w ana="at">The</w>
    > <w ana="nn1">victim</w>
    > <m ana="gen">'s</m>
    > <w ana="nn2">friends</w>
    > ...
    > </s>
    >
    > But somehow this seems a bit more verbose than it needs to be.
    > Is this format standard, or are there other XML-style annotation
    > formats in use?

    1) Yes. It is standard. Why is verbosity a problem? If you want a
    compact format, you might choose to define your own. But if
    you do that, it is a good idea to also define a systematic
    and information preserving
    automatic mapping from your compact format to a specific XML format
    and back. That way you get the benefit of XML's tools for transformation
    and validation, as well as whatever other benefits you obtain from
    working with your compact format.

    2) You may want to look at the choices made in part-of-speech tagging
    the British National Corpus. One thing I noticed in your format is
    that applying a tool like Edinburgh's textonly to it would yield
    either

    Thevictim'sfriends...

    or

    The
    victim
    's
    friends
    ...

    with the difference arising from the choice of whether to intepret the
    newlines after the </w> tags as part of the document or not.

    If you care about document layout, you may need to do something rather more
    complex. The BNC has a plausible solution to this problem (they include
    whitespace in the <W> </W> elements, but this complicates the problem of
    counting words. Whether that matters to you depends, I suppose, on what
    kinds of document you want to represent and why.

    Chris



    This archive was generated by hypermail 2b29 : Fri Jun 06 2003 - 17:51:26 MET DST