Crowdstrike failure - How? Why?

Massive outage across lots of platforms worldwide. Did they not adequately test the update? Something else?

Explain as you would a child! Bonus point for the movie reference.

Two related threads: but this can stay open. Remember people, keep this factual, use the other two for Breaking News and open discussion.

Please see excellent summary added by @Cervaise in this thread: Post #11

Best succinct explanation I’ve seen:

As for the reference: “Look, I have one job on this lousy ship. It’s stupid, but I’m going to do it, okay?”

Stranger

I think it’s pretty well known by now what happened after the update was released, but I don’t think it’s known yet what the process was that led to the update being released without being tested first.

Someone will be busy gussying up their CV, I don’t doubt. And I imagine it won’t be the boss.

That was a really fast reply ignoring the Modnote that was the first post.

Flagged for FQ mods attention.

Fair enough.

FQ mod here:
I think a mod note is sufficient, so let’s move on and just not do it again.

We don’t know but it seems likely a lack of quality control in verification; the patch that was shipped was quite obviously not tested because the corruption would have been evident to any useful level of testing. Unfortunately, this is increasingly becoming a problem as software companies are responding to cyberattacks by patching discovered vulnerabilities with all haste, often breaking their own established QC policies, which allows weak patches or the wrong branch of code to slip through. This one was just catastrophic instead of being ineffective.

Or it could have been deliberate sabotage, either by a disgruntled employee or infiltration, where again the rush to issue a patch can allow single point vulnerabilities in the verification chain. The commercial desktop operating systems are now both part of our critical economic and national security infrastructure is quite concerning as these were never designed for security from the ground up, and the security features they have are mostly tacked on as an afterthought, as is the security on the development side.

Stranger

This it the first I’ve heard that the update that was sent was all nulls—i.e. basically an empty file.

That suggests a problem with the process that sends out updates, rather than a bug in the driver itself. Any test would need to have been on the process of sending out updates, and not just on the driver.

In other words, to catch this, they would need to send out the update on some testing network, and then either try to boot them or examine the resulting file.

The video posted by Stranger is excellent, but you have to have an appetite for some hairy technical details or your eyes will glaze over. It’s really clearly explained, but you do need some tolerance for references like “buffer overflow” rushing past with an assumption that you know what he’s talking about.

If that’s too technical, here’s the very very simple version.

What is this software? Crowdstrike — or more specifically, Crowdstrike Falcon, but I’ll just call it Crowdstrike for simplicity — is security software that defends computers from attack. It’s like a guard dog at your gate. If something bad tries to get into the computer, it’s supposed to protect the machine. Lots and lots of companies around the world use this software to defend their computers. It’s one of the standards in the field, and prior to this was (generally) well regarded.

What went wrong? Defensive programs like this get updates all the time. Security is not a static effort, like a big fence. The capabilities of the bad guys are constantly evolving so the protective software needs to be regularly improved to match. In this case, an update was sent out, which is normal. However, it turns out that one of the update files for Windows (and only Windows — Linux, for example, was safe) was broken in a very weird way. (The video linked above explains exactly how weird.) So when Crowdstrike tried to load the file and update the computer, the computer was unable to start up.

Why would this prevent the computer from starting up? Defensive programs like this have to run at a very deep layer on the computer; you will hear technical terms like “privileged” and “kernel,” but what this means in simple terms is that the software lives and operates at a fundamental level in order to protect the machine. This is because viruses and other bad things have become very good at exploiting holes in computer design to insert themselves at the same very deep levels, which makes them very hard to remove. So the defensive software has to run at the same deep layer if it’s going to protect the computer, and that means it’s one of the first things that gets launched when a computer is started. Unfortunately, if the defensive software locks up with a stupid error like this during startup, it blocks the computer from proceeding to anything else.

Why did this break so much of the internet? Because computer systems have become so heavily reliant on one another. Even if a specific computer system isn’t using Crowdstrike, if it’s critically dependent on another system that does use Crowdstrike, then that first system becomes useless. Consider the point-of-sale experience, where you swipe your credit card at a restaurant or grocery store or gas station. Maybe that local system doesn’t use Crowdstrike, and is still working fine. But if, after you swipe your card, the restaurant’s computer tries to contact a card-processing system that does use Crowdstrike and is broken, the restaurant can’t clear your payment. Multiply this by the thousands of systems that lean on one another to do stuff, and anywhere you get a broken link in the chain, the whole thing locks up.

How did this bad update file get sent out? Didn’t they test it? Why don’t they do phased rollouts by region? Those are the million-dollar questions right now, and lots and lots of angry people are standing at Crowdstrike’s front door demanding answers.

Outstanding post. Thank you.

Seconded! Thank you @Stranger_On_A_Train for the video and @Cervaise for the details written in 5th-grade English. I dont understand everything but I can now parse together, generally, what occurred.

Someone asked in another thread how it’s possible for such a mishap to occur - as in why this was rolled-out globally (leaving aside the QC question for the moment), rather than in phases, or waves? Are these security updates always deployed this way?

It varies, with Microsoft itself, the updates usually propagate out over days so it isn’t global. Big updates, like service packs use to stretch out over a month.

It does seem strange to me the roll-out was global instead of incremental, but I’m guessing server and bandwidth speed have made companies want to get their security patches out there fast and it is no longer a big hurdle of overwhelming the servers.

I think it’s safe to say that anyone who’s doing it like this will stop doing it in very short order.

I have not read deeply in this, but one of the very first things I saw this morning was speculation that they were responding to a new type of attack or vulnerability. Something where speed to response would’ve been a priority.

It’s possible Crowdstrike didn’t test correctly, but it’s notoriously hard in software to test every single permutation of every configuration of every customer. As a vendor I cannot possibly do this, but customers know their specific setups well enough to keep a test lab maintained.

Maybe there is scope for Crowdstrike to tighten up their distribution procedures. But to be clear, the real culprits are the customers’ lazy IT managers who apparently configured all their systems to do immediate rollouts of whatever Crowdstrike pushed. That’s simply moronic. It should be a conscious choice predicated upon the precondition “did this change brick our test lab”.

All of these CTOs should be fired, I have no sympathy for them whatsoever.

Sure, but this didn’t affect just a few oddly-configured systems. It affected almost every computer using their software on the most popular operating system.

They probably didn’t have a choice (at least, given that they were using Crowdstrike at all). Most apps nowadays automatically accept all updates.

I don’t know if Crowdstrike is hard-configured such that it can’t be remotely enabled to allow customers control over their update schedule. If that’s how it is, we now see how abundantly stupid this decision is. Security posture is important but not at the expense of operating continuity.

Thinking back to Y2k, we got stacks of software CDs from vendors plastered with “Y2K tested and compliant.” We didn’t trust any of it, we performed all our own date-roll testing, because blaming the vendor would’ve done little if anything to offset our losses if anything went wrong. That’s the only responsible way to run an IT shop.

There is a good breakdown here:

Though IMO it’s overselling slightly how much of an expert analysis it is. This isn’t some obscure issue that needs a l33t haxor to get to the bottom of. This is a very basic C programming error…

  • Pointers are central C/C++ programming. A pointer is a just a number that says “go to this number of bytes past the start of memory and do something with the memory you find there”
  • Pointers are also the reason they are considered unsafe languages. As there is often no way to ensure that bit of memory being pointed to is actually a valid bit of memory containing what you expect
  • But that is not what’s happening here. Here you have an invalid pointer that is very easy to verify, the null pointer (as it’s just the number zero). Its common to end up with a null pointer in lots of circumstances (e.g. “give me a pointer to a Foo! No Foos available I am giving you a null pointer”)
  • As a programmer you should always check for a null pointer whenever you have a situation you could be given one.
  • Crowdstrike did not do that. They were given a null pointer and just went ahead and tried to read from that bit of memory (so rather than location 10000156 they read from 156). That’s not a valid location so the computer just barfed.
  • Normally that would be fine the program would crash, the user would swear and restart it, just a typical day. But this is a system software running at the lowest level without any OS protections, when that crashes it blue screens the computer.

That’s really bad, there are commonly used tools that detect this kind of very obvious error (it’s easy for the computer to work out you are referencing a pointer that could be null). I don’t write driver software, nothings going to blue screen of do this, but I am not allowed to check-in code without this kind of check
The fact they were not, on driver software that’s being installed on half the PCs on the planet, is crazy.