Google At How Many Billions? 9? 11?
Jean VĂ©ronis has done a mathematical study that shows that a random selection of popular words shows an almost exactly 13% increase in Google’s index in the last two months. He studied the differences in search results for 16 words, comparing results from 11/22/2004 (just over a week after Google upped its public index number from 4 billion to 8 billion) to results from 1/22/2005. As shown in this graph, all 16 queries increased in a proportion that is a quasi-perfect correlation to a perfect straight line (determination coefficient> 0,999), numbers that are too statistically significant to ignore:
If all the queries coincide with a regression of 1.13, that means Google’s index has increased roughly 13% over the two month period, which Jean says translates to an index of 9,105,590,456 pages, up from the 8,058,044,651 that has been reported over that period.
Of course, faithful readers know that Google is lying.
This screenshot, taken by myself on November 10th, shows that Google had then at a minumum, 10,980,000,000. See, that night, the new MSN Search entered beta, and Google felt the need to up its page count to make sure MSN couldn’t claim to have more pages. Since Google never reports the correct page count, all it had to do was recount its index, and post any random number higher than MSN’s 6 billion. Apparently, marketing decided 8 billion was a good number, but not before I snagged a screenshot of the results page in mid-count.
Lets make a few assumptions. First, assume that 11 billion was as high as it got, and that had I come back an hour later, it wouldn’t have been 12. Second, when Google reports there being only 8 billion pages today for the, its lying. The actual number would have to be 12,407,400,000, or 10,980,000,000×1.13. Now, any number under 8 billion, we can trust. So, narrowing it down, which my best list of 32 very popular terms on Google, removing each previous one (i.e. searching for the, then a -the, then and -a -the…):
- the = 12,407,400,000
- a = 60,000,000
- and = 40,200,000
- of =34,400,000
- i = 39,700,000
- this = 19,800,000
- that = 1,660,000
- or = 17,300,000
- you = 8,470,000
- e = 40,000,000
- s = 28,300,000
- your = 9,520,000
- page = 25,600,000
- not = 4,120,000
- d = 28,200,000
- t = 17,300,000
- us = 16,400,000
- l = 17,400,000
- c = 27,700,000
- can 1,310,000
- http = 19,500,000
- if = 1,030,000
- do = 10,300,000
- other =4,090,000
- m = 13,600,000
- o = 11,300,000
- but = 395,000
- n = 17,500,000
- y = 12,200,000
- my = 6,550,000
- news = 19,000,000
- b = 9,120,000
Oh, and assuming a 13% trend for the rest of the year, Google on January 22, 2006 would have around 15.5 billion pages.
January 23rd, 2005 at 2:51 pm
Nathan: when you search ‘the’ on Google, the figure you get is the number of indexed *DOCUMENTS*. This includes HTML documents, but also PDF/DOC/.. files.
The number shown by Google on their main page is the number of indexed *WEBPAGES* (or HTML documents).
This is the reason of the difference.
January 23rd, 2005 at 2:56 pm
I wasn’t aware of that! It’s fascinating.
I am not sure that you can trust Google’s boolean operators too much, though. I’ve checked recently that numbers just don’t add up:
http://aixtal.blogspot.com/2005/01/web-google-perd-la-boole.html
January 23rd, 2005 at 4:08 pm
I would think Google includes PDF in their number of “web pages,” not that it would matter much as these numbers seem to be not too accurate. In a past press release Google Inc. stated:
“Google Web Search: The company’s flagship search service now offers 4.28 billion web pages. Google’s powerful and scalable technology searches this information and delivers a list of relevant results in an instant. Google Web Search also enables users to search for numerous non-HTML files, including PDF, Microsoft Office, and Corel documents.”
http://www.google.com/intl/en/press/pressrel/6billion.html
January 23rd, 2005 at 4:17 pm
Google doesn’t really use anything to determine that front page number. As I said, its chosen by marketing, so they couild be talking about the number of pages, html documents, unique web addresses, .com domains, pretty much anything. And there is no disparity between the “the” search and the front page; that, in a way, proves the “the” numbers are false, because its impossible.
June 5th, 2005 at 6:07 am
This is sort of silly. When you search, you get a VERY rough estimate because Google’s algorithms are just not good at coming up with a precise number. You can more or less ignore those figures. Sometimes that estimate is messed up even when there are only 40 results. When there are 5 to 10 billion, the figure is so rough you can’t trust it.
The 8 billion number is an estimate as well, but a bit more time went into that estimate. Perhaps a little marketing as well, but the estimate on the search results page doesn’t mean a thing.