Back in March of 2007 Matt Cutts posted about a question he’d seen on the Google Webmaster forums: regarding what people should do about the search results pages from within a site, and whether or not those should be open to search engine spiders for crawling.
Is there any official Google statement regarding that search result on
one’s own site ought to be disallowed from indexing (e.g. via
Matt then detailed how Vanessa Fox had resolved the question by adding a line to the Google Webmaster guidelines.
Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.
Now I’ve just checked and that advice is still there. Which is strange, as it now seems that Google is ignoring its own advice, and the use of noindex, and indexing search results pages.
Have a look at this page from Google. What you’ll find is over 100,000 pages from Computer Weekly, all of them search results pages. And yet if you look at Computer Weekly’s robots.txt file, you’ll find the following instruction:
And if you actually check the code on any of the search results pages which have been indexed, you;ll also find the following instructions:
<meta name=”robots” content=”noindex,follow”>
<meta name=”googlebot” content=”noindex,follow”>
Seems pretty obvious what the owners of this site want Google to do with the search results pages doesn’t it? It certainly seems that they don’t want them indexed (if only because they’re keen to stay within Google’s guidelines).
So what exactly is going on here? Well to me it seems pretty clear, and it’s not anything that makes Google look good.
The news that Google was planning to start spidering the ‘deep web’ was pretty well publicised. However, it was also reported that Google would still obey robots.txt and nofollow. So why aren’t they?
I think it all comes back to the recenet addition of the ‘search within a site’ functionality, that I wrote about a while back. As I said at the time, one of the many things I disliked about this was the fact that in many cases Google might actually not be able to search a site as well as an internal search engine, as if a site was not optimised, content behind search boxes would be inaccessible.
Well it seems like Google wasn’t going to allow this to stop them, and so is now spidering those very pages it has told webmasters to block, and the only reason I can think of for them doing this is money. I’m not a big one for conspiracy theories but the only reason I can think that they would do this is that they want to be able to serve more pages within those ‘search within a site’ results, and therefore show more ads. Including of course the brand ads which were, until a remarkably short time after search within a site was released, forbidden in the UK (vive Lastminute!)
Google has now moved, in my eyes at least, from a service, helping users to get to where they want to go as quickly as possible, to a publisher, determined to hold onto them for a couple of seconds more, so that they can serve up a few more ads.
Because even if they do spider previously inaccessible pages, they still don’t allow users to search with the sort of advanced filters that many internal search engiens utilise. Which means that at the end of the day, this is the sort of decision that can only have been made by accountants rather than engineers.
Dollar bills image: yomanimus on flickr
It is really annoying when google changes tacts mid session. I have had clients who insist on allowing their search results to be indexed, especially since this allows them a whole new range of content and opportunities.
Its also funny when you are a Google CSE publisher, google encourages you to link to your popular queries results from your site. The kindly even provide the code. Admittedly its in JS, but it wouldnt be hard to set your own custom results, tag them and create a popular queries tag cloud, with results on specially created pages all indexible by google. The CSE doesnt have any rules that would disallow such behaviour.
On the spot notice: robots.txt do not prevent Google from indexing (pages might still be indexed from the external links for example). But they do prevent from crawling. Thus the pages are being indexed because Google is not able to enter the page and see ‘noindex’.
Anyway, I need to look into the case to be sure…
@rishil Do as I say, not as I do……
@Ann – sounds interesting, but I just don’t buy it. So Google enters the page from an external link but doesn’t see the noindex? Right. And even if that is the case, it still doesn’t change the fact that all of this goes against Google’s own advice, And no amount of wordplay will get round that.
I get the idea that nofollow is a very temporary measure – a quickfix solution to the problem of a link-chasing culture. I think we’ll look back in a few years time when an alternative system has been developed and laugh about the days we used to sculpt via nofollow. Oh those crazy times, hoho.
noindex however should be different – it’s almost a request for privacy, and like you say we also used to make sure we’re keeping within ‘the guidelines’. I can foresee a third-party indexation blocker emerging if it doesn’t do what it says on the tin.
Here is a link to a related discussion at Google Groups with a Google employee explaining things very well (well, better than me, I guess :))
Thanks Ann – that’s very interesting (if completely screwed) and I’ll pass it on to my old colleague.
What it still doesn’t answer is why Google’s now indexing the deep web, when they have explicitly told webmasters not to allow them to do exactly that.
Answers on a piece of paper showing Google’s share price to the normal address please…
Cheers for the link Ann, that’s a good way of putting it.
Ok Ann, I’ve been thinking some more about this.
The search results pages in question are not being indexed because they have been linked to from another site (at least according to a quick look at Yahoo Site Explorer); they are being indexed because Google has decided to index the deep web. Therefore any semantics about crawling, indexing or anything like that is just so much rubbish – they’re indexed because Google wants them to be, and nothing the site owner says is likely to make the slightest bit of difference.
What a huge black hat opportunity. More or less, what Google’s Webmaster Forum is stating (linked to above from Ann Smarty) is that if you disallow a page it can still be indexed based off of the inbound anchor text and relevance of the pages pointing to that page. Googlebot is not allowed to view the page at all (and in this example can’t determine it has a noindex tag). If Google can’t view the page at all then you could theoretically rank really well on something that has absolutely nothing to do with the page. Imagine the possibilities. Linkbait widgets that have anchor text of xyz that point to an abc page that is disallowed and thus Googlebot has no clue that the page is about abc versus xyz.
I miss the good ole days . . . too bad my hat has to be white now. This could’ve been a lot of fun. 😉
Brent, get your mind out of the gutter! 😀
Ciaran, nice find & well-written piece.
The solution is pretty darn easy: get rid of that robots.txt line… Doing both is not double protection, it just shows that you don’t know how it works 🙂 Google has ALWAYS indexed links that it wasn’t allowed to spider, for various reasons, and though I don’t agree with them doing that either, it’s how things work.
And even though you’re throwing the blame at Google now, I would seriously wonder how come 105,000 INTERNAL search result pages on that site are being linked to, apparently without a nofollow condom, thus loosing loads of link juice… What benefit is it to ANYONE to link to 105,000 internal SERPs?
Joost – it does seem like we (although I don’t work there any more) were rather naive to believe that following Google’s guidelines would mean that Google did what we asked.
I have to disagree with your second statement though; they’re not being linked to – they don’t exist unless you use the search box. And this is something that Google has only just started doing, and goes completely against their own guidelines.
btw the fix is pretty simple, change the search query from a get to a post, and it’s all done 🙂
You’re right (obviously!) but why should they have to?
Google says “don’t let us find search results” and then finds a way to do just that.
You can’t use a robots.txt AND a meta noindex together.
If you block a page with robots.txt, Googlebot will never crawl the page and will never read any meta tags on the page (so it never reads the meta noindex).
If you allow a page via robots.txt but block it from being indexed using a meta tag, Googlebot will access the page, read the meta tag, and subsequently not index it.
If you have both – the robots.txt trumps the meta noindex – as the bot won’t parse/ visit the page, to be able to read the meta noindex…. so you get a ‘thin’ result…..
i.e. exactly what you are seeing here – 105,000 thin results
Robots.txt = thin result
Meta noindex = no thin result – nothing
Use both = thin result
I don’t quite get the logic there Chris, but I think I’m going to run some specifity tests of my own with different robots.txt, meta instructions, in-site SERPs and inline nofollows.
Man, do I know how to party!