Re: Corpora: Corpora of scientific texts

John Milton (lcjohn@uxmail.ust.hk)
Fri, 23 Oct 1998 15:17:04 +0800 (HKT)

Most people on this list will be familar with web sites such as 'Books on
Line' by John Mark Ockerbloom. You can search by subject --
http://www.cs.cmu.edu/booksubjects.html
For example, the 'medicine' listing looks like it ranges from popular to
specialized, so it should be possible to start a reasonably large and
specific corpus from this site alone.

If you want to build a corpus from the web, note that publications of
several gov'ns are in the public domain (e.g., agencies of the U.S gov'n
-- but not the British gov'n: the queen owns the lot). A simple web search
in Alta Vista will retrieve thousands of potentially useful texts, though
of course a lot of judicious weeding is necessary to find materials that
are representative of the type & 'quality' you're interested in. With the
help of a student I was able to build a 45-mil word corpus more-or-less
accurately representative of a number of specific domains (unfortunatly
not available to anyone in the EEC:). Using an off-line web browser saves
a lot of time, and it only takes a simple program will filter out html
codes, graphics etc. if you want a clean ascii text.
.............................................
John Milton
Hong Kong University of Science & Technology
lcjohn@usthk.ust.hk