Re: [Corpora-List] re: pronunciation (caveat)

From: Gregor Erbach (gor@acm.org)
Date: Tue Jul 30 2002 - 15:48:19 MET DST

Next message: Tony Berber Sardinha: "[Corpora-List] Words in Context CD-ROM"

Previous message: Sabine Stoll: "[Corpora-List] Pear story corpora?"
In reply to: Damon Allen Davison: "[Corpora-List] re: pronunciation (caveat)"
Next in thread: Antoinette Renouf: "Re: [Corpora-List] pronunciation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Quoting Damon Allen Davison <linguist@socal.rr.com>:
> A caveat to all about relying too much on Google (and other search
> engines) for corpus research:
>
> Although Google allows you to define the page language for searches, it
> looks at ISO tags in the HTML source to determine this.

Not exclusively. Google also uses the document content for language
identification. Basis Technology (http://www.basistech.com/) claim
that Google is a user of their language identification software.

In WWW, the langauge can be specified in the HTML "lang" atttribute,
and in the HTTP 1.1 "content-language" response header.

> Many people who
> have their own web sites use software that by default inserts an
> English-language ISO tag into their source. Therefore, any spelling
> that happens to be a word in another language may indeed be written in
> another language, despite what the search engine claims.

I haven't found this to cause significant problems for
the Google langauge identifier.

regards,

Gregor Erbach

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dr. Gregor Erbach http://purl.org/net/gregor/
Saarland University http://www.uni-sb.de/
Computational Linguistics Dept. http://www.coli.uni-sb.de/
Project COLLATE http://collate.dfki.de/
Tel. +49 (681) 302-5354 mailto:gor@acm.org

Next message: Tony Berber Sardinha: "[Corpora-List] Words in Context CD-ROM"
Previous message: Sabine Stoll: "[Corpora-List] Pear story corpora?"
In reply to: Damon Allen Davison: "[Corpora-List] re: pronunciation (caveat)"
Next in thread: Antoinette Renouf: "Re: [Corpora-List] pronunciation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Tue Jul 30 2002 - 14:52:16 MET DST