What are good ways to stop Search engines from indexing my site? I’ve tried excluding the entire site using “robots.txt”, but this seems to be ignored. A more permanent solution would be to find the IP ranges that are being used by the crawlers and banning those, but I cannot find those anywhere. Any ideas?
I swore there was a <META> tag you could use to have spiders avoid specific pages… I’ll see if I can dig it up.
Try this out, m’dear. [tips hat]
Thank you very much, erislover, those are very good suggestions. Now, just to make things really nice, does anyone know the IPs of the common webcrawlers, so I can ban all of them too if they ignore my exclusions (which some of them do already)?
I don’t know of a general list of agents and corresponding IPs, but you’re not the first to face this so I bet there’s one available somewhere.
One easy way to get a list would be parse your server logs to remove common browser agents, and then review the remainder to see which ones you want to exclude. Many identify themselves by user-agent and it would be better to block them on that than IP since IPs may change as networks grow. The logs would also give you IPs, but not necessarily entire network blocks which may be hosting agents. Some agents may be difficult to identify (and it’s likely that those that ignore robot exclusion will also obfuscate their identity) so at some point you run the risk that you may block some obscure browser rather than some ill-behaved spider.
Possibly nosy question, but why do you want to do this?
Ah, I KNEW there was a META tag!!
May I ask exactly why you are trying to block spiders?
Ethilrist makes a good point. There’s very little reason to put something on a public server and then exclude indexing. One reason I can think of is dynamically generated pages from a very large database where you simply want to limit load. There was an early case of spiders crashing a large medical database because of fast-and-furious page requests, but many spiders now exclude URLs which include query information.
One solution might be access control (htaccess on Unix/Apache, IUSR permissions on NT, etc.) which would require a user login to access the content. This would restrict the content to authorized users which serves the function of blocking spiders.
From http://www.robotstxt.org/wc/exclusion-admin.html"
Hope that helps the OP.
Overall, the point of putting things on a public server and excluding indexing is that I simply don’t want them indexed. My server is not public, but is semi-private. I want people to be able to get to it if they know it is there, but not for it to be attracting anyone who enters “lesbian erotica” into a search engine. But there are also practical reasons behind it as well.
The main reasons are because I am running a vBulletin Board - (the UnaBoard) - and I have very limited bandwidth. And the crawlers are not supposed to be trying to index down into all of the threads, but some of them sure as hell are. This causes a huge server hit, and I am tired of it. In parsing the logs, and I have found that the crawlers seem to have several different IP ranges, even from the same company.
Certainly a valid reason, directly analogous to the medical database I cited. Well-behaved spiders are not supposed to do that. I’d be curious to know which spiders are doing it, and I would join you in complaining to their admin. If the spider identifies itself as such and still does this, it’s likely to be an oversight by the designer rather than malicious or self-centered.
If the log identifies the owner (in user-agent?), you can block based on that rather than relying on IP. IP blocks could change as the company adds new indexing resources, but it’s less likely they will rename their agent.
I didn’t mean to offend or get OT with my public/private comments. I just run into a lot of people who want to put what they consider “private” content on a public server and then get indignant when someone else sees it. This is amusing in some cases, but downright irresponsible in some of the businesses I’ve seen who have exposed private LAN content to the web and assumed it would be secure if it wasn’t linked.
Quick question: are you sure they are spiders from known search engines? Spider programs are terribly easy to code and are only as memory intensive as the programmer cares to make them, meaning they can scan stuff all day long, and in a bulletin board program that is just about evreything! :eek:
And for them to ignore a page, they’d have to code in methods to ignore pages, knowwhatImean? The biggest part of writing a spider program 9at least in Java) is coding the damn window. The actualy spider code is pretty small. (obligatory window coding remark-- UGH!)
I think you will have a hard time blocking IPs, personally. It seems like it might be possible only in theory, not in practice.
I’d stick with the robots.txt with the appropriate denial arguments, and in pages that aren’t supposed to be indexed plop the META tag in, too. (it should be easy enough to put that inside the vB template, no?)
If that doesn’t work contact the search engine company and inform them that they can pay the part of the bill corresponding to their spider searches if they don’t tell you how to keep their spiders out
I wish I knew what it actually was, as it doesn’t seem to identify itself. If I do find out, I will be doing something to complain.
No offence taken, I knew what you meant.
I should note there are email address harvesting robots on the web, they ignore the robots protocol. The best way to combat this would be to set up some sort of monitoring of incoming HTTP requests, and ban abusers from the site. But like all “best solutions,” this is the hardest to implement.
Good luck.
Banning crawlers isn’t really so tough. Set a threshold for read requests, it will vary based on the number of files per web page. It takes some fine tuning, but 20-40 read requests in a minute is a pretty good indication that a program is crawling your site and not a person behind a keyboard. How you record the requests will depend on the technology you’re using and server capabilities. I store this kind of info in a database, but that it isn’t strictly necessary. When an abuser is detected, return HTTP error code 503 “Service Unavailable” for all subsequent requests from that ip. This will prevent your server from being overloaded with requests in the short term. When you periodically review your logs, the 503’s should stand out and you can decide whether or not to permanently ban them by host name or ip.
Incidentally, instead of banning troublesome users who spam or troll your messageboards, it’s easier to simply return unusual HTTP error codes. That way, it doesn’t become a battle with the banned user out to seek revenge on the tyrannical moderators. They will think there is something wrong with your site, become pleased, and likely leave you alone. I prefer error code 410 “Gone”.
Now those are some very interesting suggestions, evilhanz. Thanks!
I ran across this and thought it apropos. It looks like it has a lot of resources for just what you need:
http://www.spiderhunter.com/