Google stopped counting, or Publicly showing, the number of pages it’s indexed in September of 05, following a school-yard”measuring contest” with rival Yahoo. That count topped out about 8 billion webpages until it was taken out of the homepage. News broke lately through various search engine optimization forums that Google had suddenly, over the previous couple of weeks, added another few billion pages into the catalog. This might sound as a reason for celebration, but this”achievement” would not reflect nicely on the search engine which attained it.
What was the Search Engine Optimization community buzzing was that the Nature of the brand new , new few billion webpages. They were obvious spam- comprising Pay-Per-Click (PPC) ads, scraped content, and they had been, oftentimes, showing up nicely in the search results. They pushed out far older, more established sites in doing so. A Google representative responded via forums into the problem by calling it a”poor data push,” a thing that met with different groans throughout the search engine optimization community.
How can someone manage to fool Google into Indexing numerous pages of junk within such a brief period of time? I will provide a top level summary of this process, but do not get too excited. Like a diagram of an atomic explosive is not likely to teach you how you can make the real thing, you’re not going to be able to run off and do it yourself after reading this article. Yet it makes for an intriguing story, one which exemplifies that the ugly problems cropping up with ever increasing frequency in the world’s most popular search engine of google scrape.
A Dark and Stormy Night
Our story starts profound in the heart of Moldva, sandwiched scenically between Romania and the Ukraine. Between fending off neighborhood vampire attacks, an enterprising local had a brilliant idea and ran with it, presumably away from the vampires… His thought was to exploit how Google managed subdomains, rather than only a tiny bit, but at a large way.
The heart of the issue is that now, Google treats subdomains much the Exact Same way as it treats full domains- as unique entities. This implies it will include the homepage using a subdomain into the indicator and Return at any stage later to do a”deep creep .” Deep crawls are Just the spider subsequent links in the domain’s homepage deeper into the Site before it finds out everything or gives up and return later for more.