Hi: The papers in Science and Nature that describe the work that Dr. Monaco mentioned can be found at http://www.ist.psu.edu/faculty_pages/giles/publications/ Dynamically generated pages are hard to count and are usually not counted. We now estimate today from growth patterns that there are about 4 to 5 billion publicly indexable web pages. Google seems to have the largest database. Any index generated by a search engine is usually not counted as a web page. Estimates of the dark web seem to based on the capture/capture method discussed in our Science paper and are probably under-estimated due to the innate limitations of this approach. Duplicate pages are hard to eliminate but much algorithmic work has gone into recognizing them. This is still an open issue however. Furthermore, one can argue that some near duplicates should be counted. For an interesting study as to how much data/information both unique and duplicate there is in the world and how much is being produced, see: http://www.sims.berkeley.edu/research/projects/how-much-info/ Best regards, Lee Giles air-l-request@aoir.org wrote:
Message: 10 From: "Ellis Godard" <ellisgodard@starband.net> To: <air-l@aoir.org> Subject: RE: [Air-l] Re: Air-l digest, Vol 1 #94 - 4 msgs Date: Tue, 28 Aug 2001 19:26:52 -0700 Reply-To: air-l@aoir.org
And how does one count dynamically generated pages? Are there as many web pages as books available through Amazon? Is google's index counted as only two web pages even though almost every instance of the second one is different?
-----Original Message----- From: air-l-admin@aoir.org [mailto:air-l-admin@aoir.org]On Behalf Of monaco Sent: Tuesday, August 28, 2001 11:38 AM To: air-l@aoir.org Subject: [Air-l] Re: Air-l digest, Vol 1 #94 - 4 msgs
number of web pages worldwide
Regarding number of web pages world wide, I was informed that C. Lee Giles at Pen State (www.ist.psy.edu/faculty/giles.html) has developed some tools to sample and estimate the number of pages. I also learned of the distinction between the dark web (pages inaccessible to crawlers) and the pages that available and referenced via search engines. It seems that one estimate has the dark web at 90% of total pages.
Greg Monaco
Gregory E. Monaco, Ph.D. Program Director, Advanced Networking National Science Foundation 703-292-8948
_--
Dr. C. Lee Giles, David Reese Professor School of Information Sciences and Technology and Computer Science and Engineering The Pennsylvania State University University Park, PA, 16801, USA giles@ist.psu.edu - 814 865 4461 http://ist.psu.edu/giles