Google respond … pt#1
08:39 Friday 1 Jul 05
I’m not sure I understand this:
Thank you for your note. Please be assured that our robots do obey robots.txt files. We’d like to investigate further, but you have blocked us from being able to access your robots.txt file. Please note that when our robots see a 403 forbidden error, they interpret this to mean that the site is safe to crawl. For more information on this, please visit http://searchenginewatch.com/sereport/article.php/2164941
So a 403 means what exactly ?
Also, the full robots.txt is actually in the comments to the post they were told to look at. But they did not.
More: Google
--
Read (2512)










1
Well…. it’s a start right?
10:18 Friday 1 Jul 05
2
True, it is.
Reading that article though - and bear in mind I’ve not seen it before and it is certainly not mention in G’s help pages - googlebot sees a site ban and promptly ignores it. It then looks for robots.txt.
If in robots.txt there is no reference to G-bot, it will crawl.
So I ban g-bot with a .htaccess rule. I knowingly use this rule. I do it because I want to ban g-bot. Thinking that g-bot is covered, I write a robots.txt that does not refer to it - after all, it’s banned, right ? But along comes g-bot, sees a 403 and ignores it.
There’s a page in G’s help pages that says they see web pages as a browser such as L.ynx does. L.ynx does not work round .htaccess bans.
Isn’t this wrong ?
(L.ynx used to allow posting. I know the real spelling)
12:25 Friday 1 Jul 05
3
Yes, that is disturbing. I can understand the behavior from a programmers point of view tho. Its like “Look for robots.txt… hmmm get a 403, ok, I can’t get a robots.txt to process, priority is to scrape sites so… assume site is safe to crawl as no information to contrary.”
That’s dumb, but in some respects logical.
It seems that the only way to disallow GoogleBot would be to let your robots.txt be readable and set googlebot to allow: none.
Annoying tho…
12:54 Friday 1 Jul 05
4
The annoyance extends though.
Look at my robots.txt:
So not only have I banned the bot, I have also told it specifically not to go anywhere but it did because it somehow got hold of 100meg of data.
This isn’t now so much as Google using code to annoy me, it’s how DO we keep a private company from crawling our sites when on the one hand they state they obey robots.txt yet on another say that robots.txt is a standard not a rule.
It seems to me that if anyone else can crawl, they will too. I just wish they would clearly explain their actions. The explanation about “we can’t do this, we can’t do that” is frankly crap. Any and all Google pages are instantly PR 10 and during that WP episode, the WP PR was slashed, then restored. That is human intervention and Google is code - and people control code.
Check this too …. I hit it last night:
http://www.tamba2.org.uk/google.png
13:12 Friday 1 Jul 05
5
I thought their point was tho that as you’ve banned them from your site (with a 403) they can’t read the robots.txt to know it’s disallowed? Or did I mis-interpret?
And yeah… they taken action when they feel motivated to. You are totally right: people control code.
re that image: interesting, very interesting. I assume you double checked for spyware? Just in case they were correct?
15:35 Friday 1 Jul 05
6
I read it as they’ve seen the 403 and “they interpret this to mean that the site is safe to crawl” and THEN they see the robots.txt which they also ignore - after all, if googlebot is banned, it gives the competition an edge.
And that page ? Nah … I was playing with anonymity programs ;)
15:40 Friday 1 Jul 05
7
From Google:
Okaaaay…..but that was already there. So how did it grab 100 megabytes of data ?
17:48 Friday 1 Jul 05
8
Is your robots.txt file CHMOD 644?
17:48 Friday 1 Jul 05
9
So would the solution be to unban them in your .htaccess (which will always be incomplete, they have hundreds of thousands of servers) and just wait until they catch up with the robots.txt?
07:24 Saturday 2 Jul 05
10
James - yea, it was readable.
Matt - I’m pondering what to do:
1 - Do as you say then watch the stats closely to check it is obeyed
2 - Allow g-bot back in and see if I figure in any results.
09:47 Saturday 2 Jul 05
11
[...] Owen offers some advice on keeping WordPress v1.5.x up to date via SVN. Michael Heilemann prefers to store his ideas via an archaic analog system. Khaled releases Rin v1.1. Michael Hampton foresees the end of free speech. Jon switches to WordPress. Orson discusses the importance of understanding animals. Angsuman releases an automated WordPress v1.5.1.2 to v1.5.1.3 patch upgrade. Mark debunks yet another asinine statement about the U.S. military. Tom reports that business blogging “more than pays for itself.” And, Mark receives a confusing response from Google. [...]
08:59 Monday 4 Jul 05
12
Did you try changing the order of your rules in robots.txt?
09:21 Monday 4 Jul 05
13
I’m actually going back to my original problem.
I banned Google because they were crawling my site but not returning me in any results. Google now have exactly the same access as MSNBot and Inktomi - so I wait to see if they return me too. Position in results in unimportant - just being there would make a change.
09:47 Monday 4 Jul 05
14
brand cialis .
12:31 Wednesday 30 May 07