Re: Corpora: Web Spider.

Philip Resnik (resnik@umiacs.umd.edu)
Thu, 7 Oct 1999 13:06:52 -0400 (EDT)

>Would you advice where to find a good web spider that can
>determine the lang. (Given some parameters, or even the alphabet).
>Mainly, i am searching for an Arabic corpora.

I'm not aware of any generally available spiders that have built-in
language identification. However, what you're interested in has been
done at New Mexico State University. See:

Jim Cowie, Evgeny Ludovik, and Ron Zacharski, "An Autonomous,
Web-based, Multilingual Corpus Collection Tool", Proceedings of the
International Conference on Natural Language Processing and Industrial
Applications. 1998. <http://crl.nmsu.edu/~raz/langrec/nlpia.htm>

Their work did include Arabic as one of the languages.

Best,

Philip
----------------------------------------------------------------
Philip Resnik, Assistant Professor
Department of Linguistics and Institute for Advanced Computer Studies

1401 Marie Mount Hall UMIACS phone: (301) 405-6760
University of Maryland Linguistics phone: (301) 405-8903
College Park, MD 20742 USA Fax : (301) 405-7104
http://umiacs.umd.edu/~resnik E-mail: resnik@umiacs.umd.edu