Odp: Corpora: Collaborative venture

From: Tadeusz Piotrowski (tadpiotr@ii.uni.wroc.pl)
Date: Thu Jun 15 2000 - 12:58:57 MET DST

  • Next message: NOELLE-VERONIQUE SERPOLLET: "Corpora: French corpora and software - Summary"

    Well, linguistics is certainly loads of fun, that is why I am doing it!!
    Lexicography is even better...

    That discussion has been one of the most informative and, err, amusing. The
    river is meandering a lot... Thanks a lot!
    I think the idea is great, and I am trying to persuade people in Poland to
    do the same, because corpora are thinly distributed here, and nobody is
    willing to share their precious collections. This way we can have a nice
    corpus.
    I am afraid, though, that the conclusion will be that linguistics is such
    fun...

    By the way, people who are and were at Cobuild and who worked hard on
    development of the unique format: you might be interested to know that there
    is a dictionary of Polish now that tries to do exactly what you did, and it
    uses a VERY similar format of description:
    Inny slownik jezyka polskiego, Warszawa 2000, PWN, ed in chief Miroslaw
    Banko, in two volumes.
    Tries, as the headword list is a compilation from previous dictionaries,
    rather than derived from frequency lists.
    Quite interesting...
    Apologies for this aside.

    Regards

    Tadeusz Piotrowski
    ***************************************************************
                                                  mailing address
    Department of English
    Opole University Zielinskiego 47/11
    Oleska 48 PL-53-533 Wroclaw
    Opole
    POLAND
                  phone/fax (+48)71-3382664

    ----- Original Message -----
    From: Jem Clear <jem@cobuild.collins.co.uk>
    To: <corpora@hd.uib.no>
    Sent: Tuesday, June 13, 2000 1:23 PM
    Subject: Corpora: Collaborative venture

    > Re: the points raised by Eric Atwell (et al.) (see snippet below).
    >
    >
    > > >I agreed if the sense tags have completely different meaning. However,
    > > >the differences in meaning between tags may be in shades of meaning
    > > >rather than the crisp decision that they are or not same....
    >
    > > ... I don't believe there is a clear, "self-evident" set of semantic
    > > tags. Semantic tagging could instead aim to annotate each word with
    > > a SET of semantic features, and "disambiguation" could aim to
    > > eliminate sematic features incompatible with context; this would
    > > allow for overlap and indeterminate sense-tagging. The set of
    > > semantic features for a word could be a bundle of semantic
    > > information, for example the lemma/root, subject-category code,
    > > selection restrictions, and meaning definition from LDOCE; instead
    > > of sense-tagging, if the aim was to eliminate features which were
    > > incompatible with context, you should get more inter-annotator
    > > agreement.
    >
    >
    > Oh dear! No, no, no. OK. Maybe I was being a little naive in
    > thinking that a large group of corpus linguists could even begin
    > to agree on a simple, but potentially useful, collaborative
    > scheme. A project in "semantic tagging" seems to my way of
    > thinking precisely what we do *not* need -- or rather we have
    > plenty of such projects going on at the moment anyway so there's
    > no widespread benefit to the linguistic community in having
    > a few more people sitting round discussing what exactly *are*
    > the set of primitive semantic components or how a semantic "entry"
    > should be structured or whatever.
    >
    > I was feeling reckless last Friday afternoon so thought I'd float
    > an extremely simple idea based on the assumption that speakers
    > of English (native or non-native) have some ability to pick from
    > a number of offered citations those which in their opinion match
    > a given dictionary definition. I am not so foolish as to believe
    >
    > a) that all respondents would select the same citations if offered the
    > same source set (this is the Consensus Issue)
    >
    > b) that the dictionary definition is "true" or "correct" or clearly
    defines
    > the boundaries of a word sense (this is the Which Tagset? Issue)
    >
    > c) that all citations selected by respondents would be "correct" (this
    > is the Quality Control Issue: aka the Noise Problem)
    >
    > Suppose in primitive times, when the only routes connecting towns and
    > villages were rough, muddy tracks, that someone proposes that the
    > community build a road by bringing bucketloads of rubble, stones, ash,
    > whatever and pack it down to make a hard flat surface. As soon as this
    > idea is proposed, one group of villagers get very excited because
    > no-one has told them how wide the proposed road should be (just wide
    > enough for one cart -- or wide enough for two carts to pass?). A wise
    > man from another town questions whether straw should be added to the
    > stones being thrown down -- straw may disintegrate and not last
    > through winter rains. Others get into fierce arguments about whether
    > the road should go straight from one village to another or should wind
    > around avoiding hills, deep valleys, marshland, etc.
    >
    > You get the idea! Just a few people bring along a few bucketloads of
    > stones and rubble and the road extends for no more than 5 metres,
    > despite the fact that almost everyone agrees that a road of some sort
    > would be much better than the rutted, filthy, muddy track along which
    > they have to walk, ride, or drive their livestock.
    >
    > Linguistics is such fun, isn't it
    >
    > Jem Clear
    >
    > Electronic Development Director phone: +44 (0)121-414-3926
    > Collins Dictionaries fax: +44 (0)121-414-6203
    > Westmere, 50 Edgbaston Park Road email: jem@cobuild.collins.co.uk
    > Birmingham, B15 2RX, UK WWW: www.cobuild.collins.co.uk
    >
    >



    This archive was generated by hypermail 2b29 : Thu Jun 15 2000 - 13:10:17 MET DST