Re: Corpora: Number of pages on the Internet

From: Patrick Corliss (patrick@quad.net.au)
Date: Mon Dec 03 2001 - 19:36:01 MET

  • Next message: Patrick Corliss: "Re: Corpora: Number of pages on the Internet"

    On Mon, 3 Dec 2001 14:48:26 +0000 (GMT), Hristo Tanev wrote:

    > The question is: approximately how many pages in English
    > exist in Internet?

    I see that you have received a good reply from Assoc Professor William H.
    Fletcher. I would make particular mention of the so called "deep web". I
    don't know the English language percentage but see the URL at:

    "BrightPlanet's search technology automates the process of making dozens of
    direct queries simultaneously using multiple-thread technology and thus is the
    only search technology, so far, that is capable of identifying, retrieving,
    qualifying, classifying, and organizing both "deep" and "surface" content."

    http://www.brightplanet.com/deepcontent/tutorials/DeepWeb/index.asp

    One of the pages includes "Deep Web Sites" which indicates that the 60 known,
    largest deep Web sites contain data of about 750 terabytes (HTML included
    basis), or roughly 40 times the size of the known surface Web. These sites
    appear in a broad array of domains from science to law to images and commerce.
    The total number of records or documents within this group is about 85
    billion.

    Basically, the folks at BrightPlanet found that "Deep Web sources store their
    content in searchable databases that only produce results dynamically in
    response to a direct request." Ordinary "spider" indexing of "surface" web
    sites misses this content, which BrightPlanet says is truly vast:

    * Public information on the deep Web is currently 400 to 550 times larger
    than the commonly defined World Wide Web.
    * The deep Web contains 7,500 terabytes of information compared to nineteen
    terabytes of information in the surface Web.
    * The deep Web contains nearly 550 billion individual documents compared to
    the one billion of the surface Web.
    * More than 200,000 deep Web sites presently exist.
    * Sixty of the largest deep-Web sites collectively contain about 750
    terabytes of information -- sufficient by themselves to exceed the size of the
    surface Web forty times.
    * On average, deep Web sites receive fifty per cent greater monthly traffic
    than surface sites and are more highly linked to than surface sites; however,
    the typical (median) deep Web site is not well known to the Internet-searching
    public.
    * The deep Web is the largest growing category of new information on the
    Internet.
    * Deep Web sites tend to be narrower, with deeper content, than
    conventional surface sites.
    * Total quality content of the deep Web is 1,000 to 2,000 times greater
    than that of the surface Web.
    * Deep Web content is highly relevant to every information need, market,
    and domain.
    * More than half of the deep Web content resides in topic-specific
    databases.
    * A full ninety-five per cent of the deep Web is publicly accessible
    information -- not subject to fees or subscriptions.

    To put these findings in perspective, a study at the NEC Research Institute
    (1), published in Nature estimated that the search engines with the largest
    number of Web pages indexed (such as Google or Northern Light) each index no
    more than sixteen per cent of the surface Web.

    With thanks to Tony Barry of the Australian [LINK] mailing list for drawing it
    to my attention with the posting below. Also to Jan Whitaker of JLWhitaker
    Associates, Melbourne, Victoria, Australia <jwhit@primenet.com> for the
    most of the above expansion (which includes her commentary).
    http://www.primenet.com/~jwhit/whitentr.htm

    On Sat, 20 Jan 2001 14:10:50 +1100, Tony Barry <me@Tony-Barry.emu.id.au>
    wrote to: <link@www.anu.edu.au>
    Subject: [LINK] Deep web

    > Extracted item for information.
    >
    > Source: THE NET NEWS
    > From Alan Farrelly
    > January 20, 2001
    >
    > - - - - -
    > DEEPEST WEB
    > The Deep Web, "hidden" under the surface Web, is much bigger than originally
    thought. The Deep Web consists of those searchable databases that only produce
    results dynamically in response to a direct request.

    > Ordinary indexing of surface sites misses this vast content. Public
    information on the deep Web is currently 500 times larger than the commonly
    defined World Wide Web, with 7,500 terabytes of data, compared to 20
    terabytes on the surface Web. That's 550 billion individual documents - while
    Google today offers a search of just 1,326,920,000 web pages. More at
    http://www.completeplanet.com/tutorials/deepweb/index.asp
    >
    > DEEP NET NEWS!
    Net News has done its bit for the Deep Web. Four of those terabytes are in
    the huge newspaper text and picture databases we've built over the last year -
    searchable text at http://www.newstext.com.au and viewable pictures at
    http://www.newsphotos.com.au and http://www.newspix.com.au - tens of
    millions of articles and photos available to anyone.
    >
    > GREY LADY EXPANDS
    And the Deep Net gets deeper. The New York Times is expanding its archives
    to include digital images of every page published from 1851 to 1998. The 3.5
    million pages are being digitised as part of a licensing deal with Bell and
    Howell:
    http://biz.yahoo.com/prnews/010112/dc_bell_ho.html
    > --
    > phone +61 2 6241 7659
    > mailto:me@Tony-Barry.emu.id.au
    > http://purl.oclc.org/NET/Tony.Barry

    Best regards
    Patrick Corliss



    This archive was generated by hypermail 2b29 : Mon Dec 03 2001 - 19:39:55 MET