If you want to build a corpus from the web, note that publications of
several gov'ns are in the public domain (e.g., agencies of the U.S gov'n
-- but not the British gov'n: the queen owns the lot). A simple web search
in Alta Vista will retrieve thousands of potentially useful texts, though
of course a lot of judicious weeding is necessary to find materials that
are representative of the type & 'quality' you're interested in. With the
help of a student I was able to build a 45-mil word corpus more-or-less
accurately representative of a number of specific domains (unfortunatly
not available to anyone in the EEC:). Using an off-line web browser saves
a lot of time, and it only takes a simple program will filter out html
codes, graphics etc. if you want a clean ascii text.
.............................................
John Milton
Hong Kong University of Science & Technology
lcjohn@usthk.ust.hk