Corpora: CL course - THANK YOU!

From: Ute Römer (
Date: Tue Apr 16 2002 - 13:13:58 MET DST

  • Next message: Maria Teresa Pazienza: "Corpora: SCIE02 - Summer Convention on Information Extraction"

    Dear list members,

    I'd like to thank all of you who responded to my query on 'How to
    organise a corpus linguistics course'! I knew that there were some nice
    corpus linguists out there but I didn't know you were so many! A big
    "THANK YOU" goes to:
    Petra Maier
    Tylman Ule
    Nadja Nesselhauf
    Tony Berber Sardinha
    Damon Allen Davison
    Bilge Say
    ? Rykov
    Francois Maniez
    Frank H. Müller
    Oliver Mason
    Geoffrey Williams
    Anke Lüdeling
    Detmar Meurers.

    The comments you made helped me to make decisions about texts to choose
    and about how to structure the course. Attached to this email

    [From listadm: removed, but available at:]

    you find the tentative schedule I handed out in the first session
    yesterday (which was fun actually: a nice small group of students who had
    no idea what corpus linguistics might be and didn't know why they had
    chosen the course but who seemed to be eager to learn everything about it
    and who all volunteered to do oral presentations).

    As for the advice list members gave me, I've added or paraphrased parts
    of their emails below.

    Thanks very much for your help again!
    All the best from Cologne,

    Petra Maier (
    I was giving a CL1 course (4 hours / week, 2 h/week reserved for oral
    talks) for severarl semesters now. Martin/Jurawsky's Introduction to
    language and speech processing turned out to be a good source for short
    oral talks. We started with 10-20 students, which was very good, but now
    we have more than 60 students and it turned out that oral talks are no
    more feasible!

    Tylman Ule (
    May I recommend as a tool for querying corpora the TIGERSearch engine
    ( It is,
    admittedtly, a query tool that mainly targets highly annotated corpora
    (Penn Treebank, Negra, Suzanne), but then, it comes for free, and has a
    powerful query language designed with the linguist in mind. It is also
    available for a number of platforms (including Mac and Windows). There
    are corpus samplers that come with the tool, and any of the supported
    corpora may be imported if you decide to buy them.
    (It is a kind of plug, because it was partly developed in the DEREKO
    project, and I was in the DEREKO team.)
    As for high-volume data, I think the BNC still has no competitor with
    respect to the fine-grained categories that let you do research on
    differences in, e.g., gender/age/origin of speaker/writer, and, of
    course, text type. The sara search engine that comes with it is
    definitely not the only way to access it, although I guess it should be
    simpler to install than any other solution. (I installed the BNC a long
    while ago, and decided to extract the data immediately for using it in
    Xlex for, e.g., concordancing
    ( - sorry, this is another
    shameless plug).

    Nadja Nesselhauf ( recommended to choose
    chapters from introductory textbooks (Biber/Conrad/Reppen 1998, Sinclair
    1991) and Charles Fillmore's 1991 as well as Inge deMönnink's 1999
    article for student presentations.

    Tony Berber Sardinha (
    I've been giving Corpus Linguistics courses here in Brazil to non-native
    speakers of English for three years now, in a postgraduate department of
    APplied Linguistics. Most students are EFL teachers and teacher trainers.
    I stick to Windows - teaching students how to use Linux would just take
    too long. What I'd recommend in terms of software is WordSmith Tools
    (about 60 pounds for an individual license) and MicroConcord (free).
    WordSmith is powerful and relatively easy to use, although some would
    object to this and say that it has a steep learning curve, which is true
    only if want to do the most 'advanced' stuff, such as key key words,
    indexing, clumps, etc. A tagger such as QTAG or WinBrill is also helpful
    (both free). As far as equipment goes, you might want to get hold of a
    computer projector for your computer room, so that students can follow you
    as click along in WordSmith Tools or any other software. Some students do
    tend to get lost in the many windows that WS Tools opens. As far as
    contents, one of the things that struck me over the years is how hard
    students find to analyze concordances, and so I devote at least 4 3-hour
    sessions to concordance analysis workshops, so that students begin to get
    a grip on how to identify patterns in concordances, represent these
    patterns consistently and evaluate their importance.

    You can see some of my course materials at

    I apologize in advance for bad links on that page since this website is
    being transferred from another location

    Damon Allen Davison (
    You should send Prof. Dr. Achim Stein an e-mail
    ( He held a course for French corpus
    linguistics a few years ago in the Romanisches Seminar in Cologne. I
    thought his organization was quite good. (Ich bin aber nicht voellig
    unbefangen, weil ich fuer diesen Kurs Tutor war...) He had a lot of
    materials in eletronic form (everything, I think). But with Anglisten,
    you might just use McEnery/Wilson as your text. They are online, of
    BTW, we did use Cygwin/GNU/Linux under Windows because Achim's tools
    were actually bash scripts. Michael Barlow's Monoconc Pro is really
    good, though. I know that he sometimes offers special licenses for his

    P bI K O B_ B.B." (
    As to me - I plan to follow McEnery's CL book and Cathrine Ball's course
    - both are in WWW.
    My only problem is: I prefer BUC as best for studying and I had it for
    free on mainframe tape. But I can not get it - because it is for fee

    Bilge Say ( sent me his own course schedule and
    I am attaching the course outline of the course "Using Corpora for
    Language Research" , hoping that it might help somewhat with the
    organization. Since my interest is in NLP and my students are cognitive
    science students, this is not strictly a Corpus Linguistics course,
    though. Some chapters of Biber's and McEnery's books might make
    presentation materials.

    Francois Maniez (
    I would also add the TACT concordancing software to the list, as well as
    the Amalgam POS-tagger, to which you can e-mail texts to be tagged (there
    is a choice of eight different tagsets). It is available at

    Frank H. Müller ( recommended to choose
    chapters from introductory textbooks rather than specific research
    articles. He also mentioned the books "Working with German Corpora" and
    "Computerlinguistik und Sprachtechnologie", edited by Kai-Uwe Carstensen
    et al. and an online course written by one of his colleagues and
    accessible via (link:

    Oliver Mason ( wrote:
    I wouldn't spend too much time on technical issues, as most UGs will
    probably not have to deal with that a lot. Geoff's "Language and
    Computers" gives a good intro to how to get your data. Annotation is also
    something that I feel is a bit overrated, as most people want to do
    analysis, not annotation, and you also have technical issues to deal with
    > Then I still need to find some more short articles (not longer than
    > 10-12 pages) I could give to my students to prepare short oral
    > presentations. Will selected sections from (introductory) CL textbooks
    > be more useful for this purpose than descriptions of specific research
    > projects?
    Not sure about CL textbooks; I'd rather use more applied texts, which show
    the benefit of corpus methods. I did a couple of sessions on lexicography
    recently, and Looking Up was a useful source which the students felt gave
    them good reasons why you would want to use a corpus.

    Geoffrey Williams (
    I teach a course in corpus linguistics in Nantes for students following
    the licence Sciences du Langage programme. These are studying general
    linguistics and will have had a course in English applied linguistics
    taught by myself in the first semester. I use the latter to give them the
    necessary background to contextualism. The CL course is optional in the
    second semester and consists of 12 2hour blocks in a computer room. I find
    that these students need a rapid hands-on approach whilst being given some
    background to humanities computing. The level of computer knowledge is
    highly variable, often nil, so it is necessary to be clear as to the
    difference between a Word doc and a text file as they do not always see
    the difference. This is all done through a mixture of self discovery and
    teaching. Like Tony I do not use linux, much as I would like to, as it is
    not readily available to these students. I quickly get onto text and use
    the concordancer generously provided free by Darmstadt, Wincord at . This runs under
    Windows and does all the basic tasks needed in discovering language in
    context. I show them WordSmith as this is the best, but the fac is too
    stingy to buy anything. I do not go into POS tagging due to lack of time,
    but concentrate on what a concordancer can show when working on plain
    text. Once they have seen a concordancer at work I go into more detail as
    to what constiututes a corpus etc. For background reading I recommend:
    Tognini Bonelli, E. 2001. Corpus Linguistics at Work. Amsterdam: John
    PARTINGTON A. 1998. Patterns and Meanings, Amsterdam : John Benjamin's
    Kennedy G. 1998. An introduction to Corpus Linguistics. Longman
    and of course
    SINCLAIR J. McH., 1991 Corpus, Concordance, Collocation. Oxford: Oxford
    University Press.

    Anke Lüdeling (
    if you want to work on German in your corpus class: there is a very nice
    tool for tagging and morphological analysis called MORPHY which you
    could use. I like it a lot for teaching purposes because it is easy to
    understand and to use - you can change a number of parameters and
    directly see the consequences. Another bonus is that the documentation
    on MORPHY is very well-written. The two texts given below are so clear
    that students will be able to understand it without too much prior
    MORPHY can be downloaded from
    The documentation can be downloaded from
    For students I would especially recommend Rapp and Lezius (2001) and
    Lezius, Rapp & Wettler (1998).
    If you are still looking for (short) research papers, look at the
    proceedings of the Corpus Linguistics Conference in Lancaster, 2001:
    there are a number of very interesting topics.

    Detmar Meurers ( stressed the importance of focussing on
    theoretical aspects within corpus linguistics.
    As for short articles for student presentations he recommended "Corpus
    Annotation" by Roger Garside, Geoffrey Leech and Tony McEnery. He also
    recommended McEnery/Wilson's "Corpus Linguistics" as an introductory

    This archive was generated by hypermail 2b29 : Wed Apr 17 2002 - 22:07:11 MET DST