Re: Corpora: Corpus Junk mail

From: Jose Maria Gomez Hidalgo (
Date: Tue Mar 12 2002 - 10:30:55 MET

  • Next message: Kiril Simov: "Corpora: CFP: Treebanks and Linguistic Theories 2002"

    At 16:18 11/03/2002 +0100, you wrote:

    >I'm planning to write a program that uses statistical methods to identify
    >junk e-mail. Does anyone know of a corpus of junk mail that I could use ?

    A number of collections of spam and legitimate messages can be accessed
    from my page on Machine Learning for spam detection, at, including:

    * Ling-spam (
    and PU1 ( by
    Androutsopoulos and colleagues

    * Spambase (
    hosted at the UCI Machine Learning Repository
    ( and built by George Forman
    and colleagues

    These are relatively standard collections used for evaluating spam
    detection approaches, as you can see in my bibliography

    Koltz and colleagues comment in their paper (available at that they
    plan to make their spam collection public at their corporation website
    ( This collection may be very interesting.

    Alternatively, you can build a spam vs legitimate collection using widely
    known spam repositories. the problem is legitimate email, which is not
    usually public. As Androutsopoulos, you may use messages from a public
    list, but the most sensible approach is use some publicly donated personal
    email, in order to reflect personal email usage.

    Hope this helps

    >Cormac O'Brien
    >Department of Linguistics
    >University of Gothenburg
    >Box 200
    >S-405 30 Gothenburg
    >0046 (0)31 773 5234


    Jose Maria Gomez Hidalgo
    Departamento de Inteligencia Artificial
    Universidad Europea de Madrid - CEES
    28670 - Villaviciosa de Odon - MADRID
    (+34) 912115670

    La legislación española ampara el secreto de las comunicaciones. Este
    correo electrónico es estrictamente confidencial y va dirigido
    exclusivamente a su destinatario/a. Si no es Ud., le rogamos que no difunda
    ni copie la transmisión y nos lo notifique cuanto antes.

    Spanish law guarantees privacy in electronic communications. This
    electronic transmission is strictly confidential and intended solely for
    the addressee. If you are not the intended addressee, you are kindly
    requested not to disclose nor to copy this transmission and to notify us as
    soon as possible.

    This archive was generated by hypermail 2b29 : Tue Mar 12 2002 - 10:40:36 MET