# Welcome to www.subsume.com. We like most robots, but some are just greedy bastards. # Contact the webmaster if you're listed but wisely addressed the issue of concern. # Crawls from a number of different NXDOMAIN IPs, and always with a falsified referer. User-agent: sosospider Disallow: / # There is nothing right about this spider. User-agent: GingerCrawler Disallow: / # Doesn't properly parse certain URL schema, resulting in more 404 madness. # A lot of IPs have been banned for that behavior, but this is the first "named" spider # to be so poorly written. User-agent: FindLinks Disallow: / # Another one that 404s with non-http links. If I start seeing many more like this, # I'm just going to say "fuck it" for entries in this file and go straight to IP # banning of even the named bots. User-agent: Gaisbot Disallow: / # Is there some shit library that all these bots are using that explains why # they can't parse a decent URL? # Oh, and bonus points for having the most generic name here, zoominfo.com dolts. User-agent: NextGenSearchBot Disallow: / # Last one I'm going to list that 404s with non-http links. Expect this file to get shorter soon. :-/ # Also, how many fucking IPs (with no rDNS on top of it) does this thing need to come in on? # Bonus points for being in SPEWS! I should ban on that alone, but this is their one chance to obey exclusion. User-agent: Gigabot Disallow: / # Inexplicably, Yahoo! Slurp intentionally 404s using a generated URL that starts with # /SlurpConfirm404/ # We're not here to feed up pages that confirm or deny your spider's internal state. # By fishing for 404s you get nothing, or 403s if you ignore this. # Update: We just saw 209.191.87.215 spider a 404 page and 68.142.249.17 spider robots.txt. # Update: Clearly they lie about exclusion support. Welcome to an IP ban, Yahoo. User-agent: Slurp Disallow: / # These fuckers are just like Slurp, only they're using # /this_is_a_test_of_404_response # This is a test of your exclusion support, assholes. User-agent: BecomeBot Disallow: / # Any set of morons that generate intentional 404s like # /randome2bcf4ef13b2ce4e1d77c2abc8df4315 # I'm getting so sick of this that I'm jumping right to an IP ban, too. User-agent: OutfoxBot Disallow: / # Another intentionally 404ing spider. Another exclusion, another IP ban. User-agent: e-SocietyRobot Disallow: / # Ah, MSN, why I've allowed your bad behavior for so long is a good question. # Your crimes are: crawling at a rate 100 times more than anyone actually gets referred by you, # tons of 404s for URLs we *never* had on our server (e.g., /Gen_2002/images/BookbagButton1.gif), # and for putting *any* focus on search when you can't even ship your core OS as promised. User-agent: msnbot Disallow: / # FAST Enterprise Crawler/6 comes in from an IP with no reverse DNS. # It does a crapload of crawling, but I'm not seeing any referred traffic. # It does not have a proper bot page that I could find, so User-agent is a guess. # It my not even honor this file at all, which will lead to an IP block. User-agent: FAST Disallow: / # ZyBorg/1.0 Dead Link Checker is another intolerably bad bot. # Everything that applies to the previous agent applies to this piece of crap, too. # The connection (WiseNut/LookSmart) with the shitty grub-client (listed next) doesn't surprise me. User-agent: ZyBorg Disallow: / # You fuckers aren't honoring the * disallows, so you don't get to see anything. # And if you don't honor this, we'll go to blocking specific hosts. # Update: We are now blocking host IPs. Die! User-agent: grub-client Disallow: / # Another bot that ignores * disallows, even though they claim they follow the protocol. # And what the hell is with Yahoo-VerticalCrawler-FormerWebCrawler in the agent? Pick a name! # This may be the same bot that was listed as FAST above, but it gets a special list. # Dirty, dirty bot. I kind of hope this is ignored so I get to block by IP. # Update: It is! I do! User-agent: fast Disallow: / # More * ignorance. User-agent: NaverBot Disallow: / # Intentionally generates 404s by changing the case of a known good URL it just spidered. # We don't know if it's testing case sensitivity or what, but we don't really care. # Use the bloody URL you're given! User-agent: baiduspider Disallow: / # Also 404s URLs by changing case. User-agent: LNSpiderguy Disallow: / # QuepasaCreep is an unknown spider that screws up all links it tries. User-agent: QuepasaCreep Disallow: / # VoilaBot is an another spider that seems to 404 all the time. # And despite a complete / ban, it still bugs us multiple times a day. # Say hello to a 195.101.94.0/24 block you greedy French fucks! # The bastards are coming in on 193.252.148.0/24, too. s/are/were/ User-agent: VoilaBot Disallow: / # This agent charges a fee for its "services" but provides sites with no compensation. # You want to make money by leeching content from our site? Pay us. # Plus, we think stupid people should be allowed to copy in lieu of learning. # More idiots in the market makes us look like absolute geniuses by comparison. User-agent: TurnitinBot Disallow: / # And now for some universal blocks. # # Site changes moved things around so even if old WebObjects links work, they shouldn't be indexed. # Nobody should be searching directly in the pub for binaries. All the good stuff has pages. User-agent: * Disallow: /static/WebObjects/ Disallow: /pub/