As in, what do they actually do? I get this whimsical image of gnomes and fairies gluing things back together and sprinkling fairy dust over things, but these are just rows and rows of servers. What are they doing that requires hours of down-time? Defragging? Running a virus scanner? Making sure none of the programmers have secretly installed nude skin patches for the Night Elves as a quest reward?
It could be a long list of different thinks
Upgrading hardware (CPU, Disk Drives etc)
Upgrading Software (OS, Game software, Virus Scanner etc)
Upgrading Network Components
Something to do with electricity in the server room, having to cut the power
Perhaps also allowing the servers’ OS to cycle (like you’re supposed to restart Windows periodically to clear variable tables or something esoteric like that)?
Ahh a fellow WoW addict. Tuesday maintenance was the bane of my existence until I got a full time job XD
I assume they turn the servers off and on to boost performance (they will do rolling restarts occasionally, but they always restart the servers at least once a week, almost always on Tuesday). They also apply hotfixes to abilities or bugfixes to some bosses, little stuff like that. Beyond that I don’t know… not a Blizzard employee myself.
Add to that, system diagnostics and (likely) scans to detect some varieties of cheating (for instance, checking for duped items).
Since you mention the pain of Tuesday mornings, I’ll assume you are a Warcraft guy. What is going on:
1: Database maintenance. This is the biggie. Databases like this, that are hit thousands of times a minute by thousands of users who may or may not drop their connections during a transaction, who may be madly clicking on the latest exploit, these databases eventually get lost records, bad index items, orphaned transactions, whatever. You may have seen this sort of thing when you try and get an item out of your mail only to be told you can’t because you already have one in inventory. The databases need to be cleaned up and rebuilt periodically. This takes a good chunk of time and is usually the main time-suck for the weekly downtime.
2: Applying server-side patches. Usually - but not always - these patches have to be loaded, tested, etc. while the database in a sleeping state. No questing allowed while the code monkeys at Blizzard are wiring in the next wing of Icecrown or whatever.
3: Semi-intrusive reports. A lot of the reporting that they need to decide how to proceed with 1 and 2 works best if no one is playing.
4: Server maintenance. Running a check of file permissions, disk structure, file integrity, etc etc. Again, works best if no one is playing.
5: Additions to the databases. New stuff means new items in the database. Best to run those scripts while the data is NOT in flux or being accessed.
6: Restarts. These dont take a long time, but are often necessary to kick items 1 2 3 4 and 5 into gear. These usually only take 15 minutes at the end of the maintenance, but you may have to do them a few times depending on what you see when the server comes back.
Critters stuck in terrain, too, gets fixed by reloading from a clean start.
I don’t know anything about MMOs, but here’s an educated guess based on my experience with web services:
It’s software updates. Nothing else. MMOs (as with other high availability services) run on multiple redundant servers. Preventative hardware maintenance, replacing failing hardware, network upgrades, scans and diagnostics, filesystem and database maintenance are all done routinely without any need for taking the entire system down. If a server or piece of network equipment needs to be powered off or rebooted, that server is taken out of rotation first, and the other servers take over its job until it’s back online.
A software update is the only thing that might require all servers to be offline simultaneously, so as to avoid compatibility problems when some servers have been updated and some have not.
Servers are not rebooted as a matter of course to clear or reset things.
NB I’m talking about scheduled outages here. A power failure is a major disaster that’s likely to lead to customers leaving a data center (example).
ETA: actually there is one other thing that can require downtime: database schema changes.
Something else which may be done during those periods is backups: for other database systems I’ve worked with, the admins preferred to be able to perform backups with nobody “in the system”, so the backup would be “cleaner”: you wouldn’t copy Table A, then by the time you’d gotten to Table Y which makes use of the data in Table A have Y refer to a register which your own copy of A did not have, as it had been entered during the backup.
Try telling Windows to copy a huge directory from a location to another, and then editing files from the original directory: depending on the version you have and the program you’re using, you may get some sort of error message when the system gets to the file you’ve got open (in other words, I’ve seen it happen). Now picture one of Blizzard’s servers getting a fried hard disk. Now picture thousands of extremely-pissed-off customers whose characters have gone to the big data vault in the sky… not pretty! So, backups, and depending on the admins preferences, during downtimes. Not because the backup needs to be done during a downtime, but because, since you’re having a downtime anyway, it’s a good time to run a backup.
For a high availability system, backups would normally be done with replication, which continually duplicates a database to slave servers in a way that maintains integrity.
this is not unusual in small business setups, larger or high throughput businesses like wow do things like this.
It is not horribly unusual to see this in places like large restaraunts (I have seen a 4 machine cluster at a cheesecake factory restaraunt) with 120 full tables on a saturday night you do not want a dead power supply stopping $5,000-$10,000 an hour in business.
Since it’s the second time someone has felt the need to punctualize… that explanation wasn’t intended as an exact explanation of “backups in a server-heavy environment”, I know about “constant backing up” and had skipped over that because frankly it seems to go over the heads of most users (yeah, yeah, me bad). But I’ve also seen places which had both the constant duplication to a second server placed in the same room and “cold backups” (hey, maybe the admin was just old-fashioned, what he liked to say was “professionally paranoid”) which took advantage of downtimes scheduled for other reasons in order to perform backup copies into physical supports which would then be taken to a different location.
The ones that kill me are those which do the physical backup during downtime and store the copies right there. It’s going to be a bitch if there’s a fire in that room, you know… if you’re going to be professionally paranoid, doesn’t it make more sense to be it all the way?
Moved from GQ to the Game Room.
Greasing the servos, dusting the vacuum tubes, tightening bolts, that kind of thing.
Feeding the hamsters and oiling their wheels.
Also - this
The elves, gnomes and orcs take off the big papier-mache heads, mope their brows and head out to the closest bar. A grizzled supervisor yells at them to get some sleep as they go out the door. You may have noticed on a Wednesday morning that your character is sluggish and laggy? Just take him for a run through the Barrens a couple of times, sweat it out.
Heh…this reminds me of a “Blackwing Lair Employee Lounge” thing someone once posted (way, way back) on the official forums.
It’s a mandated time of rest for the hamsters operating the servers.
Based on an actual case of a hamster dropping dead after a particularly tough raid session. It went all the way to the supreme court were Mr Biggles was the deciding vote.
It’s Blackrock Depths and not Blackwing Lair, but I always thought of the Grim Guzzler as something like that, a place the NPC’s go to chill out when they aren’t being attacked by raid groups.