It’s certainly not in the Rules. There is this rule:
But spidering the board obviously does not fall under this rule, as even folks who have not registered as members (such as BoardReader) have been permitted to do this for years. Blocking them is just one line of code (deny 74.204.5.205) in .htaccess and yet it has never been done.
“Running scripts” is also an extremely dubious definition. If your browser is configured for prefetching, you are running scripts. Browsers themselves, especially in the case of Firefox, are largely built using Javascript.
BoardReader are forbidden from spidering the site in the robots.txt file - they just choose not to respect that convention (well, actually I believe they’re just looking in the wrong place for the file, but never mind). It’s not practical to expect the admins to go through and IP ban every spider that doesn’t respect the standards.
Anyway: regardless of the rules, or the fact that other people are allowed to do it (or do it even when they’re not allowed), it’s really, really silly to advise individual users to effectively download the entire database with wget. You’re essentially advocating a DDoS attack on the board when it’s already plainly at its limits. Please, please, don’t do this.
No offence meant, but let’s keep a little perspective, eh? Search has been disabled for a matter of days. It’s plain that the admins don’t regard this as a permanent measure. Can’t we survive for a week or so before we all start caching our own copies of ten years of posts? It doesn’t matter whether you can parse the rules to justify your actions; you’ll bring the board to its knees. Think these things through, eh?
SIAM SAM, your little friends are wrong. They have been affected by the skepticism of a skeptical age. They do not believe except [what] they receive in their email as gluge. They think that nothing can be trusted which is comprehensible by rational consensus or by anyone who places their trust in critical thinking and in fighting ignorance. All minds, Siam Sam, whether they be men’s or children’s, are little. In this great universe of ours man is a mere insect, an ant, in his intellect, as compared with the boundless world about him, as measured by the intelligence capable of grasping the whole of truth and knowledge. All the same, we can observe, study and learn, and we can yet still have room for the occasional leap of faith.
Yes, SIAM SAM, there is a Cecil Adams. He exists as certainly as rational thought and edification and curmudgeonly resistance to ignorance exist, and you know that they abound and give to your life its highest beauty and joy. Alas! how dreary would be the world if there were no Cecil Adams. It would be as dreary as if there were no SIAM SAMS. There would be no learning then, no verification, no analysis to model this existence. We should have no scientific learning, except the the type that is forwarded to ten of our closest friends so that something lucky will happen to us. The eternal light with which knowledge fills the world would be extinguished.
Not believe in Cecil Adams! You might as well not believe that cornflakes is my real name! You might get your papa to hire men to watch all the doors of the Chicago Reader to catch Cecil Adams, but even if they did not see the Perfect Master walking into the building, what would that prove? Nobody (except Ed Zotti) sees Cecil Adams, but that is no sign that there is no Cecil Adams. The most real things in the world are those that neither children nor men can see. Did you ever see quarks dancing inside an atom? Of course not, but that’s no proof that they are not there. Nobody can conceive or imagine all the wonders there are unseen and unseeable in the world.
It’s of the order of 5-10 gigs, but do you not realise what that would do to the server if you downloaded it all in one go, one page at a time through the board software? No-one gives a crap how much of your own hard disk you waste, but if more than a few people try to download the entire board at once, it will absolutely kill it. The admins have just disabled searching precisely because the server can’t take the load. Does this not tell you anything?
I can’t make it plain enough: this is a spectacularly stupid thing to do. DO NOT DO IT. You will KNACKER THE BOARD.
Jeez. I’m no fan of how the board is run, but seriously; search on a database the size of ours is intrinsically problematic in vBulletin (see here). For the love of God give the admins some time to sort it out. I know it’s belated on their part, and I know it’s annoying, but for God’s sake let’s have a little perspective.
Right, you would want to automate that process. Trust me, the sysadmins at the Reader wouldn’t be the first to have to tackle this problem. There are plenty of open source solutions. Despite the slowness of the boards over the years, and despite the fact that it’s been known for years that BoardReader has been spidering the Dope, sucking up precious Hamster seconds and profiting off it to the chagrin of subscribers, no sysadmin ever bothered to go in and put a stop to it. That puts it in the class of sanctioned activites. I point this out and suddenly there is an outcry. Ban alterego! He pointed out the obvious! If anything, downloading a copy of the dope for personal perusal is legally protected (especially with nary a mention of it in the TOS), whereas a commercial website violating robots.txt has been found illegal in the past.
Oh, for crying out loud. Who called for your banning, hmm? (And no, ultrafilter didn’t.) All I said was that regardless of whatever rubbish you choose to spout about external spiders and ridiculous parsings of non-existent rules, advising individual users to download entire copies of the board for personal use is really fricking stupid, for the sole reason that it will completely clobber the board at a time when you already know that it’s at its limits. How you can possibly quibble on this point is beyond me. The “outcry” (and I don’t see how one person pointing out a perfectly sensible technical objection constitutes an “outcry”) is entirely pragmatic.
Furthermore, the current obsession with BoardReader “profiting off the boards” bewilders me, given that it’s currently the only method of searching the bloody thing. Use BoardReader’s search for the time being and give it a rest, would you? Why on earth are you “chagrined” that someone else is providing you with a service whose very absence from these boards you simultaneously lament? It’s utterly perverse.
I wasn’t proposing knackering the board by downloading the whole thing through vBulletin. Jesus. I just find it damned unbelievable that, woah!, 10! -20! gigs!! of text!!! is even a problem at ALL, of any proportion. I mean, jeez, I could download every message ever written on the board in like an 4 hours (yes, I do have a 10mbps connection), from my home computer, if someone cared to zip it up and put it on an ftp accessible area. Searching a database of that size is a problem that was solved many years ago.
Yes, it does. A great deal.
I’ll stop now, as I’m heading (or have headed) into Pit territory. Carry on.
The number of users who can run the command I posted above is very small and there are easier methods using GUIs that allow for more configuration. I was simply making a point - the gates are already open and the policies are backwards. When the search engine is not disabled, the boards are slow. This is partly due to the search engine, and partly due to external sites that have been explicitly blocked from spidering the boards doing so anyway, and profiting off of it. If the board were instead to allow Google to search the SDMB the board would no longer be slower due to searches being performed and the SDMB would profit from it. Allowing Google to do what BoardReader is already doing solves all of the problems and costs nothing, especially if you then begin blocking BoardReader, which should have been done long ago.
Yes, but no-one’s going to do that, and the command given above is not the same thing at all. The bandwidth is irrelevant. The point is the vast number of page generations and database requests that you’re putting the server through. We’ve got about 350,000 threads on here, many with far more than one page, each of which has to be dynamically generated by the board. You’d be making something like a million database requests all in one go. Proper web spiders have rate-limiting techniques that prevent them from overloading servers. The command above gets as much as it possibly can, as fast as it possibly can. This is Not Good. I repeat: pointy server death will ensue, as surely as the wild shits follow a dodgy curry.
What? I don’t understand why Google is fundamentally any different to BoardReader; what “problem” does Google solve that BoardReader does not? Both spider relatively slowly so as not to put excessive load on either the boards they spider or their own servers. This is clearly preferable to a bunch of users nabbing their own copy of the board in one go, whether it annoys you that other people are spidering the board or not. Your argument seems to be that because the admins haven’t specifically banned other people who spider the board, it’s fine for you to. No matter what I think of this reasoning, it’s irrelevant. I don’t care what implicit rules you think you can read in to the actions or non-actions of the admins (although you’d think that the “forbid: /” line in robots.txt is pretty explicit); your advice will knacker the board and will benefit almost no-one.
I don’t think either Google or BoardReader are a satisfactory replacement for the board’s own search in any case, since both lack features like searching by forum and user, searching only thread titles, sorting by date, etc. and so forth. Google is even less messageboard-aware than BoardReader is, which at least appears to understand the concept of a thread.
It’s been six days, people. The way some of you are acting you’d think it was Chinese water torture.
Why all the outrage? The announcement was made only a week ago. And even though big meetings were planned, and Ed has logged in several times since (including yesterday), perhaps you are asking too much.
I never jump into these threads. I’ve generally been satisfied with the way the boards run - albeit a bit frustrated at times - and i do appreciate the work the admins and mods put in.
But count me in as one of the (no longer) silent Teeming Millions who would just like a f*cking straight answer as to when we can reasonably expect this situation to be resolved.
All this "We’ll say more when we can’’ and ‘‘good news is on the horizon’’ bullsh*t has gotten really old.
I will happily renew my subscription if CL can get off their collective butts and make a decision - (I am aware that it’s not Jerry, Ed, etc) - but it doesn’t matter if you’re implementing a ‘two-tier system’, upgrading the DB release, or sticking electrodes up the hamster’s *sses.
We paid for a service that includes certain base functionality - search being one of the big ones. Fix it, or at least give us some information and a firm date of when we can expect to see it fixed. Renewal season for a large number of us is merely weeks away, and I for one will NOT be renewing without some straight answers on what we’re looking at and when we can expect to see some improvement or change.