Why Do PCs Lock Up?

What is happening internally when a PC locks up? Did it hit an endless loop? And, how come it is quite often true that nothing can break a PC out of this stupid trance?

Most operating systems these days have two different operating modes, user mode and system mode. Generally speaking, system mode is for the operating system, and user mode is for running programs. One important note though is that things like device drivers run in system mode.

If a program in user mode goes into an endless loop, it’s usually no biggie. The system is still running, so you can use the operating system to kill the offending program.

If there’s a problem with a device driver or part of the operating system though, that’s when you tend to lock the entire PC up. If you end up with a bug like an endless loop while you are in system mode you never get back out of system mode, so you can’t use the operating system to recover from it because you never get back to the OS kernel.

There are certain types of hardware problems that can cause the PC to lock up as well.

Usually because whatever has caused the endless loop has too high a priority for anything else to take charge, or because the system is in an inconsistent state and cannot be recovered to a running state anyhow.

There are ways of preventing some of these lockups with appropriate design. A CPU has a number of ways that things can get attention - these are interrupts and can be hardware or software. A CPU getting an interrupt should stop what it is doing to service the interrupt, and then return to it’s normally scheduled processing. If you set up a regular hardware timer on to a hardware interrupt (preferably a NonMaskableInterrupt, one that cannot be disabled), you can regularly take control of the CPU - this is called a watchdog timer. During your watchdog routine, you can check to see if the CPU is scheduling other processes normally, and kill/restart/reboot any subsystem that is detectably not operating correctly. A decent microkernel architecture with subsystem isolation helps, too. Of course, this adds overhead and complexity, and the system can still be corrupted to the point that the watchdog routine fails, or failure conditions are not correctly detected. PCs do not have such a watchdog and PC OS’s don’t support the concept, but many embedded systems rely on them as a way of ensuring correctable behaviour.

Si