What makes you happy ?

"Karma man, just remember Karma. Treat things nice and nice things happen to you." © Claire

Google ? Lying ? Noooo……

19:10 Saturday 25 Jun 05

Checking stats for my WP pages, I see this in the stats:

I’ve included MSNBot there for comparison. Note the amount of data and the last visit.

Now look what is in my robots.txt:

User-agent: Googlebot
Disallow: /

Now let’s check what the Google help pages say:
“robots.txt is a standard document that can tell Googlebot not to download some or all information from your web server.” ( http://www.google.com/bot.html#robots )

This site and this site BOTH say my robots.txt is valid. So what gives ? Google say they obey it - and would you believe that they are lying ?

More: Google
  1. Rob Mientjes
    1
    • Wow, that’s very wrong. Mail the bastards. Or won’t they listen?

    19:23 Saturday 25 Jun 05


  2. MacManX
    2
    • I’ve noticed the same. I have robots.txt set to disallow all robots access to the cover images from my library, yet I still find them listed on Google.

    20:33 Saturday 25 Jun 05


  3. Mark
    3
    • My full robots.txt

      User-agent: HenryTheMiragoRobot
      Disallow: /

      User-agent: Googlebot
      Disallow: /

      User-agent: *
      Disallow: /gallery
      Disallow: /games
      Disallow: /images
      Disallow: /nota
      Disallow: /stats
      Disallow: /upb
      Disallow: /getout.php

      Their docs say the bot obeys the FIRST rule it finds specific to it.
      Obviously it does not.

      I will be asking them what’s up, yes.

    20:35 Saturday 25 Jun 05


  4. Mark
    4
    • I’ve submitted a question asking for an explanation.
      If you are the person from Google looking at this, you can post here with the answer if you want ..

    20:47 Saturday 25 Jun 05


  5. Cameron aka desk003
    5
    • this is mine:
      User-agent: *
      Disallow: /cgi-bin/
      Disallow: /private/
      Disallow: /scgi-bin/
      Disallow: /old/
      Disallow: /new/
      Disallow: /backup/
      Disallow: /_images/
      Disallow: /webalizer/
      Disallow: /willoway/
      Disallow: /stuff/
      Disallow: /images/

      and I think that keeps googlebot at bay.

    02:17 Sunday 26 Jun 05


  6. Mark
    6
    • From the help pages:

      The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule.

      In my robots.txt, the longest does not apply, the second is as specific as you can get.
      Robots.txt is created to allow flexibility and that is an option I wish to exercise. MSNBot is welcome here, Googlebot is not.

      I want the Google person to tell me how to exclude their bot because even though I am following what i think are the rules, Googlebot is disobeying them.
      If I have not heard a decent reply in a few days I will post this over at Webmasterworld and other such forums to both publicise it and get more information.

      Fact is that even when Google did crawl my site they refuse to return me in results. They’ve taken 100meg of my data - go search for ‘tamba2′ - you will not find a single direct link. Not one. Hardly fair is it ?

    09:01 Sunday 26 Jun 05


  7. TigerDE2
    7
    • Well, I’m not a native English speaker, but to me, your last quote says

      Googlebot is taking the longest rule applicable it can find.

      So, basically, Googlebot sees:

      User-agent: Googlebot
      Disallow: /

      which is very specific but two lines long, and then it finds this:

      User-agent: *
      Disallow: /gallery
      Disallow: /games
      Disallow: /images
      Disallow: /nota
      Disallow: /stats
      Disallow: /upb
      Disallow: /getout.php

      And that rule includes Googlebot and is way longer than the first one, so it’s picked…
      At least that’s what I think… :wink:

    17:16 Sunday 26 Jun 05


  8. Cameron aka desk003
    8
    • Mark, you were just owned by TigerDE2. :razz:

      I think (s)he’s correct.

    17:21 Sunday 26 Jun 05


  9. Mark
    9
    • TigerDE2 - that could well be true ….
      And this leads to what made me ban Googlebot in the first place - go to google and look for “Mark tamba2 wordpress”. Now if Googlebot has been taking my data, why will it not return that data in a search ?

      I’m going to look again.

    17:29 Sunday 26 Jun 05


  10. MacManX
    10
    • The standard says we should obey the first applicable rule, whereas Googlebot obeys the longest (that is, the most specific) applicable rule.

      Now that’s just plain idiotic. Googlebot should obey any rules set to “User-agent: Googlebot” and any rules set to “User-agent: *”, not just the longest. Basically, Google’s help pages say two things:

      1. “Use robots.txt to block the Googlebot.”

      2. “The Googlebot will not obey the standard syntax of robots.txt file.”

      This remind me of the average American:

      1. The average American will not vote for the President that’s best for him/her.

      2. He/she will vote for the President with the biggest smile.

    01:54 Monday 27 Jun 05


  11. MacManX.com » Blogroll Dive: 6/27/05
    11
    • [...] nt of view on the upcoming 9/11 memorial. Tom ruminates on RSS and its possible uses. And, Mark discovers that the Googlebot is disobeying his robots.txt file.

      [...]

    08:08 Monday 27 Jun 05


  12. Angsuman Chakraborty
    12
    • Why do you ban Googlebot in the first place? Is it because you weren’t getting high ranking as you seem to indicate in comments?

      I think you should let them know. They do respond fairly quickly.

    12:06 Tuesday 28 Jun 05


  13. Mark
    13
    • I had a PR of 6 or 7 - I forget.

      I have pointed this post out to them on the day I made the post. I have heard nothing from them.

    12:13 Tuesday 28 Jun 05


  14. MacManX
    14

    19:34 Tuesday 28 Jun 05


θ α λ κ

Think. Then type.

*     *    

Comment RSS / Trackback




|| Home ||

FreshlyPressed - Feed - Privacy - 2.7-hemorrhage - 3,205 - 10,336 - 0.234