Re: [Corpora-List] labels of COLT files in BNC spoken

From: Eric Atwell (eric@comp.leeds.ac.uk)
Date: Wed Nov 12 2003 - 12:04:12 MET

  • Next message: M M Hasan: "[Corpora-List] CFP: Workshop on Asian Language Resources (ALR-04)"

    Ute,
    Leeds PhD student Bayan AbuShawar faced exactly the same problem
    - she is researching "chatbots", programs you can chat with,
    and has a program to "train" a chatbot with a BNC spoken dialogue
    transcript. We thought it would be interesting to use this
    chatbot-learner to "visualise" the language of London teenagers
    by training with some COLT files, but it wasnt easy for Bayan to track
    these down. David Lee's excel spreadsheet doesnt explicitly include
    "from COLT" as a field; Knut Hoffland supplied a list of first lines of
    COLT files to match against BNC files, so we matched up some this way;
    but as far as I know there isnt anything in BNC documentation equivalent
    to a list of filenames of files from COLT - Bayan ended up searching all
    spoken transcript files including teenager speakers (speaker age is in
    the header info).

    If you (or soemone else) discovers a solution, do please let us know...

    and in the meantime, feel free to try out the chatbots we have trained
    on various BNC files at http://www.comp.leeds.ac.uk/eric/

    - we have to demo these at the BCS Machine Intelligence contest at
      Cambridge Univ, December 16th, as an example of Machine Learning used
      to visualise sublanguage ... so feedback to help us carry off the
      trophy and GBP1000 cash prize is welcome!!!

    cheers

    eric atwell

    On Tue, 11 Nov 2003, Ute Römer wrote:

    > Dear all,
    >
    > I was wondering if anyone of you could tell me which text files in the BNC are COLT files. I checked David Lee's Excel spreadsheet and the BNC World list of texts (on the SARA2 start page) but didn't find the information I was hoping to get (maybe I didn't search long enough though).
    > The thing is that I'm trying to nail down repeated occurrences of "ai n't" plus progressive form (and missing form of TO BE plus progressive form) in BNC (spoken) data which I don't get in my Bank of English (brspok) data. I thought that the amount of teenage and adolescent language in the BNC might be a possible explanation for fragmentary constructions. It's not a big thing, really, and I suppose I could check the headers of all the BNC files my concordance examples come from (to see how old the participants are), but maybe there is an easier/faster option.
    >
    > Thanks in advance and best wishes. Ute
    >
    >
    > ************************************************************
    >
    > Ute Römer
    > English Department
    > University of Hanover
    > Königsworther Platz 1
    > 30167 Hannover
    > Germany
    >
    > Phone: +49 (0)511 762 2997
    > Fax: +49 (0)511 762 2996
    > E-mail: ute.roemer@anglistik.uni-hannover.de
    > http://www.fbls.uni-hannover.de/angli/
    >
    >

    -- 
    Eric Atwell, Senior Lecturer, Computer Vision and Language research group
    Distributed Multimedia Systems MSc Tutor & SOCRATES/JYA Tutor
    School of Computing, University of Leeds, LEEDS LS2 9JT
    TEL: 0113-3435761  MOBILE: 0775-1039104 FAX: 0113-3435468
    WWW: http://www.comp.leeds.ac.uk/eric  EMAIL: eric@comp.leeds.ac.uk
    Visit http://www.computingLEEDS.ac.uk - our newsletter for industry
    



    This archive was generated by hypermail 2b29 : Wed Nov 12 2003 - 12:27:51 MET