Re: [Corpora-List] labels of COLT files in BNC spoken

From: Eric Atwell (eric@comp.leeds.ac.uk)
Date: Thu Nov 13 2003 - 14:00:26 MET

  • Next message: Sebastian Hoffmann: "Re: [Corpora-List] labels of COLT files in BNC spoken"

    Lou,
    thanks for this expert clarification.
    Demo chatbots trained with a variety of BNC files are now on my web-page
    http://www.comp.leeds.ac.uk/eric/ and we can add more ....

    - I have a follow-up question: can you suggest any specific BNC spoken
      files which illustrate particularly "interesting" / idiosyncratic
      language use? For example, the BNC file with the most swearing? :)
      We want to identify a selection of "unusual" files, to train
      a collection of noticeably different chatbots.

    thanks

    Eric

    On 13 Nov 2003, Lou Burnard wrote:

    > Apologies for not contributing to this enquiry sooner. A number of
    > different issues seem to be confused here:
    >
    > 1. Which bits of COLT also appear in the BNC?
    > 2. How do I find out which bits of the BNC contain London teenage
    > speech?
    > 3. Is "ain't" characteristic of spoken London teenage language?
    >
    >
    > Here's what *I* think on each of these (see also
    > http://www.hf.uib.no/i/Engelsk/colt/COLTinfo.html):
    >
    > 1. None! COLT is the brainchild of Anna Brita Stenstrom and colleagues
    > at Bergen. With funding from Longman and others, they collected the
    > audio material which is the "fons et origo" of this material. Longman
    > made a transcription of (most of) this audio material and contributed it
    > to the BNC. Bergen made a *different* transcription of (most of) the
    > same audio, using different conventions, and different markup, and also
    > substantially revised the part of speech tagging. The result was
    > eventually published as COLT. They did not include any way of linking
    > their transcription to the older transcription in the BNC, in particular
    > they did not specify which files correspond with which. The BNC files of
    > course combine all conversations collected by a single respondent into
    > one file, whereas Colt has them in separate files.
    >
    > 2. Easy. Look at the <catRef> element in the header of each text and
    > select those which have appropriate values: (sdeage1 sdeage2 sporeg1 to
    > be exact). This gives 43 texts thus classified. You could further refine
    > this by looking for words like London in the header, of course, but it
    > probably isn't worth the effort.
    >
    > 3. Hmm. The problem is in the transcription. As Ylva Berglund found in
    > her study of "innit", any pronouncements about relative rates of these
    > quasi-lexicalized words in speech and writing have to be hedged around
    > with all sorts of caution. The BNC speech transcriptions went through at
    > least two normalization stages -- one using the transcriber's judgment
    > as to what was intended, and the other using an automatic spelling
    > correction tool. Paradoxically, I would expect "aint" or "ent" or
    > "innit" to get tidied up into "isn't" disproportionately more often in
    > the spoken transcripts than in the written texts, precisely for that
    > reason. You can't argue with "ain't" when it's there in black and white
    > on the page. The COLT speech transcription, however, was made by people
    > with a different agenda, and so I would expect them to both more
    > sensitive to and more likely to wish to record such variation than the
    > BNC speech transcribers.
    >
    > Lou Burnard
    >
    > On Thu, 2003-11-13 at 07:38, Ute Römer wrote:
    > > Dear Eric, Bayan, and others,
    > >
    > >
    > > > but as far as I know there isnt anything in BNC documentation equivalent
    > > to a list of filenames of files from COLT
    > >
    > > That's too bad. I was sure there had to exist such a list somewhere but
    > > apparently it doesn't (or nobody knows about it).
    > >
    > > I'm not 100% sure yet (more concordance checks required), but I think I've
    > > found the 377 COLT files. Last night I scrolled through the list of BNC
    > > texts (in SARA; unfortunately, it's not possible to copy and past this list
    > > to search it automatically) and checked the bibliographic reference for
    > > quite a number of those labelled "n conversations recorded by X" in the
    > > list. It looks as if files KNR to KR2 and KSN to KSW (51 files, consisting
    > > of 1 to 39 conversations each) are COLT files, or most of them at least. You
    > > get information like
    > >
    > > "<hi>7 conversations recorded by `Robin' (PS58K) [dates unknown] with 6
    > > interlocutors, totalling 1126 s-units, 5165 words (duration not
    > > recorded).</hi>
    > >
    > > PS58K `Robin', 14, student, AB, male
    > >
    > > PS58L `Jones'teacher, male
    > >
    > > PS58M `Zoe', 13, student, female
    > >
    > > PS58N `Ben', 14, student, male
    > >
    > > PS58P `Oliver', 13, student, male
    > >
    > > PS5AV `Jenny', 13, student, female"
    > >
    > > -- sounds very COLTish to me.
    > >
    > > Also, I had a look at some headers of these files (checked the BNC texts in
    > > version 1.0 though) and spotted lots of COLT key items like "Hackney" or
    > > "Greater London". I then saved these 51 BNC files as a subcorpus and did a
    > > concordance check of "ai" in this collection (using SARA2) and of "ain"
    > > ("ai" didn't work here) in the real COLT (using WST). I found 307
    > > occurrences in my supposed COLT and 293 in the real one - not 100%
    > > convincing but not too bad either.
    > >
    > > However, if these files (my saved "COLT?" BNC subcorpus) really make up
    > > COLT, then most of my occurrences of "ain't" are not from teenage language.
    > > So, unfortunately, all that searching, browsing, and alerting you hasn't
    > > really solved my problem. Anyway, I guess I know a bit more about the BNC
    > > and COLT contents now (and about the importance of knowing exactly what's in
    > > your corpus - and, ideally, where it is).
    > >
    > > Thanks to Eric and to Linda Bawcom (who contacted me off the list).
    > >
    > > Best from Hanover... Ute
    > >
    > >
    > > ************************************************************
    > >
    > > Ute Römer
    > > English Department
    > > University of Hanover
    > > Königsworther Platz 1
    > > 30167 Hannover
    > > Germany
    > >
    > > Phone: +49 (0)511 762 2997
    > > Fax: +49 (0)511 762 2996
    > > E-mail: ute.roemer@anglistik.uni-hannover.de
    > > http://www.fbls.uni-hannover.de/angli/
    > >
    > >
    > > > Bayan ended up searching all
    > > > spoken transcript files including teenager speakers (speaker age is in
    > > > the header info).
    > > >
    > > > If you (or soemone else) discovers a solution, do please let us know...
    > > >
    > > > and in the meantime, feel free to try out the chatbots we have trained
    > > > on various BNC files at http://www.comp.leeds.ac.uk/eric/
    > > >
    > > > - we have to demo these at the BCS Machine Intelligence contest at
    > > > Cambridge Univ, December 16th, as an example of Machine Learning used
    > > > to visualise sublanguage ... so feedback to help us carry off the
    > > > trophy and GBP1000 cash prize is welcome!!!
    > > >
    > > > cheers
    > > >
    > > > eric atwell
    > > >
    > > >
    > > > On Tue, 11 Nov 2003, Ute Römer wrote:
    > > >
    > > > > Dear all,
    > > > >
    > > > > I was wondering if anyone of you could tell me which text files in the
    > > BNC are COLT files. I checked David Lee's Excel spreadsheet and the BNC
    > > World list of texts (on the SARA2 start page) but didn't find the
    > > information I was hoping to get (maybe I didn't search long enough though).
    > > > > The thing is that I'm trying to nail down repeated occurrences of "ai
    > > n't" plus progressive form (and missing form of TO BE plus progressive form)
    > > in BNC (spoken) data which I don't get in my Bank of English (brspok) data.
    > > I thought that the amount of teenage and adolescent language in the BNC
    > > might be a possible explanation for fragmentary constructions. It's not a
    > > big thing, really, and I suppose I could check the headers of all the BNC
    > > files my concordance examples come from (to see how old the participants
    > > are), but maybe there is an easier/faster option.
    > > > >
    > > > > Thanks in advance and best wishes. Ute
    > > > >
    > > > >
    > > > > ************************************************************
    > > > >
    > > > > Ute Römer
    > > > > English Department
    > > > > University of Hanover
    > > > > Königsworther Platz 1
    > > > > 30167 Hannover
    > > > > Germany
    > > > >
    > > > > Phone: +49 (0)511 762 2997
    > > > > Fax: +49 (0)511 762 2996
    > > > > E-mail: ute.roemer@anglistik.uni-hannover.de
    > > > > http://www.fbls.uni-hannover.de/angli/
    > > > >
    > > > >
    > > >
    > > > --
    > > > Eric Atwell, Senior Lecturer, Computer Vision and Language research group
    > > > Distributed Multimedia Systems MSc Tutor & SOCRATES/JYA Tutor
    > > > School of Computing, University of Leeds, LEEDS LS2 9JT
    > > > TEL: 0113-3435761 MOBILE: 0775-1039104 FAX: 0113-3435468
    > > > WWW: http://www.comp.leeds.ac.uk/eric EMAIL: eric@comp.leeds.ac.uk
    > > > Visit http://www.computingLEEDS.ac.uk - our newsletter for industry
    > > >
    > > >
    > > >
    > >
    > >
    > >
    >
    >
    >

    -- 
    Eric Atwell, Senior Lecturer, Computer Vision and Language research group
    Distributed Multimedia Systems MSc Tutor & SOCRATES/JYA Tutor
    School of Computing, University of Leeds, LEEDS LS2 9JT
    TEL: 0113-3435761  MOBILE: 0775-1039104 FAX: 0113-3435468
    WWW: http://www.comp.leeds.ac.uk/eric  EMAIL: eric@comp.leeds.ac.uk
    Visit http://www.computingLEEDS.ac.uk - our newsletter for industry
    



    This archive was generated by hypermail 2b29 : Thu Nov 13 2003 - 14:15:23 MET