Re: [Corpora-List] neologism finder tools

From: Antoinette Renouf (ant@rdues.liv.ac.uk)
Date: Thu Jun 12 2003 - 17:28:46 MET DST

  • Next message: Fabio Ciravegna: "[Corpora-List] ***extended deadline*** WORKSHOP ON ADAPTIVE TEXT EXTRACTION AND MINING"

    Dear Eric, Sylvana
    As Eric says, `new' has to be defined with respect to `old'. We define
    this in three different ways. In each case, I use `corpus' to refer to
    a chronological flow of electronic text (e.g. newspaper text) that is
    marked for approximate authorial date.

    One method is to bootstrap, by which I mean start with
    day/week/month/quarter one text chunk and assume all its words are new,
    store them, run them against the text time-chunk 2, find the
    differences and call those `time-chunk 2 new words', run the cumulative
    list against chunk 3, and so on. This method means that you will have a
    graph of neologistic occurrence which registers 100% for time-chunk 1,
    but which evens out after a few months, as the normal rate of coinage
    is allowed to show through. At that point, you can disregard the data
    for the first few months statistically; linguistically, you observe
    those so-called new items with a pinch of salt and a lot of intuition.

    The second method is related: according to this, you take the first 3
    months or so of text flow and wordlist it, so that it becomes your
    master wordlist, that which you deem for convenience purposes to
    represent `the established lexicon' at start of play. You run the 4th
    month of text against it, and so on.

    The third method is to take a large external corpus or wordlist,
    authored prior in time to your data flow, as your established lexicon.
    The similarity or difference between this earlier text and your own
    will affect your results - for instance if you run your newspaper
    corpus against a masterlist of novels - or of a different newspaper.

    All this raises many questions, as you can see - not least, whether the words
    that newly appear are new or just first-time occurrences in that arbitrary data
    sample (and so possibly not neologisms but revivals, highly-technical terms,
    etc). But it is an approach that is automatable.

    With static corpora, such as the 60's LOB/Brown; and the 90's
    FLOB/Frown, you can compare the earlier and later comparative corpus
    wordlists to find differences. Many linguists study static corpora but
    talk about change in them by using their own native-speaking intuition
    as the implicit source of authority as to established use.

    We have made 1 month's of neologistic data available for April 1998 online; and
    our neologism detection software can under some circumstances be
    licenced and applied to any corpus.
    see http://www.rdues.liv.ac.uk/aprdemo/

    Antoinette Renouf

    --------------------------------------------------
    Research and Development Unit for English Studies
    University of Liverpool
    19 Abercromby Square
    Liverpool
    L69 7ZG

    tel sec unit: +44 (0)151 794 2289
    tel: +44 (0)151 794 2286
    fax: +44 (0)151 794 2298
    email: ajrenouf@liv.ac.uk
    url: http://www.rdues.liv.ac.uk

    > Date: Thu, 12 Jun 2003 15:07:59 +0100 (BST)
    > From: Eric Atwell <eric@comp.leeds.ac.uk>
    > X-X-Sender: eric@cslin148.csunix.comp.leeds.ac.uk
    > To: krausse <krausse@fh-nordhausen.de>
    > cc: corpora@hd.uib.no
    > Subject: Re: [Corpora-List] neologism finder tools

    > Sylvana,
    > A problem with "retrieving new words in a corpus" is: "new" with respect
    > to what? You can easily find all words in a corpus with only one (or
    > two..) occurrences, which makes them "rare"; but "new" implies
    > your corpus builds on a larger monitor corpus tracking the language over
    > time. As I understand it, AVIATOR/APRIL is not just software for a
    > static corpus but infrastructure for processing a (large) monitor corpus.
    > Is this what you have?
    >
    > Eric Atwell
    >
    >
    > On Thu, 12 Jun 2003, krausse wrote:
    >
    > > Dear colleagues,
    > >
    > > In Lynne Bowker's and Jennifer Pearson's book "Working with Specialized
    > > Corpora" neologism finder tools like the ones used in the AVIATOR/APRIL
    > > project are mentioned.
    > >
    > > I wonder whether there are any free or commercial programs available or
    > > how other people go about retrieving new words in a corpus.
    > >
    > > Many thanks in advance,
    > >
    > > Sylvana Krausse
    > >
    >
    > --
    > Eric Atwell, CVL: Computer Vision and Language research group
    > Distributed Multimedia Systems MSc Tutor & SOCRATES/JYA Tutor
    > School of Computing, University of Leeds, LEEDS LS2 9JT
    > TEL: 0113-3435761 MOBILE: 0775-1039104 FAX: 0113-3435468
    > WWW: http://www.comp.leeds.ac.uk/eric EMAIL: eric@comp.leeds.ac.uk
    > Visit http://www.computingLEEDS.ac.uk - our newsletter for industry
    >
    >



    This archive was generated by hypermail 2b29 : Thu Jun 12 2003 - 17:30:27 MET DST