Re: [Corpora-List] re: pronunciation (caveat)

From: Gregor Erbach (gor@acm.org)
Date: Tue Jul 30 2002 - 15:48:19 MET DST

  • Next message: Tony Berber Sardinha: "[Corpora-List] Words in Context CD-ROM"

    Quoting Damon Allen Davison <linguist@socal.rr.com>:
    > A caveat to all about relying too much on Google (and other search
    > engines) for corpus research:
    >
    > Although Google allows you to define the page language for searches, it
    > looks at ISO tags in the HTML source to determine this.

    Not exclusively. Google also uses the document content for language
    identification. Basis Technology (http://www.basistech.com/) claim
    that Google is a user of their language identification software.

    In WWW, the langauge can be specified in the HTML "lang" atttribute,
    and in the HTTP 1.1 "content-language" response header.

    > Many people who
    > have their own web sites use software that by default inserts an
    > English-language ISO tag into their source. Therefore, any spelling
    > that happens to be a word in another language may indeed be written in
    > another language, despite what the search engine claims.

    I haven't found this to cause significant problems for
    the Google langauge identifier.

    regards,

       Gregor Erbach

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Dr. Gregor Erbach http://purl.org/net/gregor/
    Saarland University http://www.uni-sb.de/
    Computational Linguistics Dept. http://www.coli.uni-sb.de/
    Project COLLATE http://collate.dfki.de/
    Tel. +49 (681) 302-5354 mailto:gor@acm.org



    This archive was generated by hypermail 2b29 : Tue Jul 30 2002 - 14:52:16 MET DST