[Corpora-List] Gross language detection

From: Jose Maria Gomez Hidalgo (jmgomez@dinar.esi.uem.es)
Date: Wed Jan 08 2003 - 11:27:44 MET

  • Next message: Jose Maria Gomez Hidalgo: "Re: [Corpora-List] Gross language detection"

    Dear all

    As a part of a classified ads posting system, a group of natural language
    processing students supervised by me have to develop a gross language
    detection system for the Spanish language. I do not know if there is any
    work in this area (except maybe [1]).

    Dou you have ideas of how to do this?

    It seems rather heuristic, but my basic idea is:

    1. To build a dictionary of forbidden words (f**k, etc)
    2. To develop a set of regular expresions that allow to detect variations
    of the forbiden words (e.g. if "xyzt" is a forbidden word, then we have to
    detect "XyZt", "X_Y_Z_T" or little letter changes for slang - a "k" instead
    a "c", etc).

    Thank you for your help

            Jose Maria

    _______________________________________________________________________________

    Jose Maria Gomez Hidalgo
    Departamento de Inteligencia Artificial
    Universidad Europea de Madrid
    28670 - Villaviciosa de Odon - MADRID
    (+34) 912115670
    jmgomez@dinar.esi.uem.es
    http://www.esi.uem.es/~jmgomez/
    _______________________________________________________________________________

    La legislación española ampara el secreto de las comunicaciones. Este
    correo electrónico es estrictamente confidencial y va dirigido
    exclusivamente a su destinatario/a. Si no es Ud., le rogamos que no difunda
    ni copie la transmisión y nos lo notifique cuanto antes.

    Spanish law guarantees privacy in electronic communications. This
    electronic transmission is strictly confidential and intended solely for
    the addressee. If you are not the intended addressee, you are kindly
    requested not to disclose nor to copy this transmission and to notify us as
    soon as possible.



    This archive was generated by hypermail 2b29 : Wed Jan 08 2003 - 11:31:17 MET