[Corpora-List] seeking knorpora advice

From: Marco Baroni (baroni@sslmit.unibo.it)
Date: Sun Oct 12 2003 - 00:56:27 MET DST

  • Next message: Aris: "[Corpora-List] List of Meronyms available?"

    Dear all,

    I am in the process of creating a modified version of the knoppix live
    cd for computational/corpus-based linguistics students.

    As the knoppix page (www.knoppix.net) says, ''KNOPPIX is a bootable CD
    with a collection of GNU/Linux software, automatic hardware detection,
    and support for many graphics cards, sound cards, SCSI and USB devices
    and other peripherals. KNOPPIX can be used as a Linux demo, educational
    CD, rescue system, or adapted and used as a platform for commercial
    software product demos. It is not necessary to install anything on a
    hard disk. Due to on-the-fly decompression, the CD can have up to 2 GB
    of executable software installed on it.''

    Knoppix can be extremely useful for people who want to test or learn
    linux, but do not want/cannot install it.

    The modified version I am preparing will have a set of tools and data
    that are specifically geared towards computational/corpus-based
    linguists who want to try linux. I will make the iso image available on
    my site.

    What I would like to ask you is:

    - what kind of programs/tools would you recommend for the cd (of
    course, they must compile on linux)?
    - what kind of data (corpora, word lists...) would you include in the
    cd (I am particularly interested in freely distributable corpora)?

    I am looking for things that are released under the GPL license or
    similar, so that I will not have problems putting them on the cd.

    In general, I would prefer easy-to-use, not-too-specialized programs
    that work on the command line.

    Some things I am planning to include:

    - N-gram Statistics Package (http://www.d.umn.edu/~tpederse/nsp.html)
    - K-vec++ (http://www.d.umn.edu/~tpederse/parallel.html)
    - WordNet (http://www.cogsci.princeton.edu/~wn/)
    - ACOPOST collection of POS taggers (http://acopost.sourceforge.net/)
    - bow toolkit (http://www-2.cs.cmu.edu/~mccallum/bow/)
    - parts of the OPUS corpus (http://logos.uio.no/opus/)
    - various perl modules that are useful for corpus/nlp work
    - my own scripts for term extraction and downloading corpora from the
    web

    What else???

    Of course, if somebody has already done something like this, I would be
    very curious to hear about it.

    If you are receiving this through the corpora list, please reply to me
    directly. If there is interest, I will post a summary of the replies to
    the list. And I will definitely let you know when the cd is ready.

    Thanks in advance!

    Regards,

    Marco

    ---
    Marco Baroni
    University of Bologna
    http://sslmit.unibo.it/~baroni
    



    This archive was generated by hypermail 2b29 : Sun Oct 12 2003 - 01:14:32 MET DST