Back in March of 2007 Matt Cutts posted about a question he’d seen on the Google Webmaster forums: regarding what people should do about the search results pages from within a site, and whether or not those should be open to search engine spiders for crawling.
Is there any official Google statement regarding that search result on
one’s own site ought to be disallowed from indexing (e.g. via
Matt then detailed how Vanessa Fox had resolved the question by adding a line to the Google Webmaster guidelines.
Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.
Now I’ve just checked and that advice is still there. Which is strange, as it now seems that Google is ignoring its own advice, and the use of noindex, and indexing search results pages.
Have a look at this page from Google. What you’ll find is over 100,000 pages from Computer Weekly, all of them search results pages. And yet if you look at Computer Weekly’s robots.txt file, you’ll find the following instruction:
And if you actually check the code on any of the search results pages which have been indexed, you;ll also find the following instructions:
<meta name=”robots” content=”noindex,follow”>
<meta name=”googlebot” content=”noindex,follow”>
Seems pretty obvious what the owners of this site want Google to do with the search results pages doesn’t it? It certainly seems that they don’t want them indexed (if only because they’re keen to stay within Google’s guidelines).
So what exactly is going on here? Well to me it seems pretty clear, and it’s not anything that makes Google look good.
The news that Google was planning to start spidering the ‘deep web’ was pretty well publicised. However, it was also reported that Google would still obey robots.txt and nofollow. So why aren’t they?
I think it all comes back to the recenet addition of the ‘search within a site’ functionality, that I wrote about a while back. As I said at the time, one of the many things I disliked about this was the fact that in many cases Google might actually not be able to search a site as well as an internal search engine, as if a site was not optimised, content behind search boxes would be inaccessible.
Well it seems like Google wasn’t going to allow this to stop them, and so is now spidering those very pages it has told webmasters to block, and the only reason I can think of for them doing this is money. I’m not a big one for conspiracy theories but the only reason I can think that they would do this is that they want to be able to serve more pages within those ‘search within a site’ results, and therefore show more ads. Including of course the brand ads which were, until a remarkably short time after search within a site was released, forbidden in the UK (vive Lastminute!)
Google has now moved, in my eyes at least, from a service, helping users to get to where they want to go as quickly as possible, to a publisher, determined to hold onto them for a couple of seconds more, so that they can serve up a few more ads.
Because even if they do spider previously inaccessible pages, they still don’t allow users to search with the sort of advanced filters that many internal search engiens utilise. Which means that at the end of the day, this is the sort of decision that can only have been made by accountants rather than engineers.
Dollar bills image: yomanimus on flickr