Crowdstrike failure - How? Why?

I only watched a little of that video, but I think the video just shows a screenshot of the start of the hexdump; that is, it’s just showing that the first 400 bytes of the file are all zeros. Does he somewhere confirm that the whole file is actually all zeros? Or maybe the whole file is intentionally all zeros and the “logic error” is a failure to handle that case. Although admittedly it’s hard to see how a file containing all zeros would convey useful information to the code.

Starting at about 2:50 in the video he describes the file and says “the .SYS file that acted as an update to CrowdStrike somehow got shipped as all nulls” and then says basically the same thing again twice in the next 15 seconds.

So he’s pretty much on the record as claiming that every byte in the file is zero.

It’s very odd and it would be nice to get some clarification from CrowdStrike.

I’ve never used Crowdstrike, but yeah, from experience ‘realtime’ updates are a specific selling point for some corporate security software (the notion being that the earlier you get the update, the lower the risk of falling victim to an attack) - and it was built right into the client software, with no option for staging or managing it.

From a serious IT standpoint, giving up control is undesirable and risky.

However, from a management standpoint, vendor-driven updates are a big plus, because you can eliminate the IT staff that would otherwise be in charge of staging, testing, and releasing them. And even better, you have an outside party to point the finger at when things break.

I work at a fintech providing analytical and reporting services to major investment institutions. I’m in charge of our contractual-compliance posture, ensuring we can deliver everything we say we can, from business product to resiliency. On the client side, the vendor-management teams could not possibly care less about the operational details on whether our services actually work. They care only that they can freely blame us in the event something goes wrong. Their technical people do care about how our platform is designed, how our technology integrates with the systems on their side, but contract ownership is a third-party oversight matter. We play nice with their tech folks, but contractually, we are not obligated to. Their management doesn’t care. Their management only wants a club to beat us with if we fail.

I mean, I do agree with the principle expressed above. But short-sighted, incompetent management with misplaced priorities and malformed incentives is a major obstacle to healthy technology oversight in the modern marketplace, and appears to have contributed, in a broad sense, to this debacle.

Was this bug really so simple?
Basically, I’m asking if it really is possible for just one person to crash the entire internet?

Sure, it’s possible for a single thing to break a lot of stuff. The software wasn’t on every part of the Internet, but it was present on enough of the component parts to break anything that is composed of multiple parts, and that’s most things.

As for one person doing it, that really comes down to process within the software house, and the number of checks and balances that are in place, but even with all the right procedures in place, accidents can still happen because nothing in this universe can be perfect. Even supposing the update was fully tested and signed off and approved to go live, there are still ways it could go wrong - some of them exotic, others as simple as the person ultimately responsible for pressing the button selecting the wrong thing.

My money is on someone in middle management yelling ‘why isn’t this release live? I don’t want to hear excuses! Do it NOW!’ and someone in release administration either caving to that pressure, or maliciously complying, knowing it could go wrong.

Except in a case like this, no matter how the lawsuits and arbitrations and so on go, it’s not possible that the party to blame will be able to even come close to paying reparations on all of the damages they caused.

Totally agree. Now explain that to the people in management, who seem unaware that they’ll be in a queue of other claimants when a major failure condition occurs.

Wow! Thank you! That was so clear. I’ve been trying all weekend to understand this.

So this in fact not a good breakdown, a clue would have been if I read to the end where he goes on a racist anti-DEI and anti-Rust* rants. A non-racist and actually correct breakdown is here:

Its not missing a null check there is a null check in the disassembly. Its actually a table of pointers with some uninitialized entries. So the 156 byte value could be any value that happens to be in memory.

* - I seriously don’t get this idea that the Rust programming language is somehow “woke”. Wtf is actually wrong with these idiots? I mean clearly a lot, and this is not the worst of them, but still…

Maybe a bit of a nitpick, but I don’t think that’s a valid criticism either of C++ or of what probably happened here. Every computer ever made uses pointers at the machine language level, whether it’s indirect addressing or something like the value in a register modifying the target of a memory reference instruction. And really, every high-level language uses pointers in some sense, even if they’re not called that. For instance, even in an ancient language like FORTRAN, array indices are essentially pointers, and if an array index goes wonky due to a programming error, the program will try to read or write in an inappropriate location, possibly outside of its own allocated memory.

Of course it will fail due to memory protection, and will be stopped by the OS on an invalid memory access trap. And therein lies the real unsafe aspect of the Crowdstrike code: the fact that at least part of it runs in kernel mode, or at least is self-evidently able to affect the behaviour of kernel mode components. This kind of code is effectively part of the OS and can do whatever it wants, and thus has to be thoroughly tested. AIUI either this code wasn’t, or something in the deployment infrastructure wasn’t. And trust me, the deployment infrastructure can sometimes be a problem, especially if it’s old and creaky.

Unrolled the thread at Thread By @taviso - This strange tweet got >25k retweets T.. for non-twitter users.

In programmer culture being anti-woke overlaps a lot with “I use a hard language so I’m better than everyone” coder-bro culture. Any language that makes programming easier is just letting less-hard-core coders in.

Galaxy Quest.

Although all compiled machine code does essentially use pointers, it is typically managed either by the complier or by a dedicated memory management routine to assure that the programmer isn’t tripping over their own genitalia by either not properly initializing a pointer, dereferencing the wrong memory address, or allowing a memory buffer overrun (unintentionally using up all available allocated memory or using memory dedicated to system processes). Of languages in common use today, only C/C++ and some of the derivatives permit dynamic allocation of pointers and use of pointer arithmetic without some strong typing or trapping, and most robust languages have some kind of garbage collection so the programmer doesn’t incidentally induce a buffer overrun in some untested edge case. People here and elsewhere have really critiqued Fortran, for instance, for being obsolete even for computationally intensive programming, but you essentially can’t getting a buffer overrun in Fortran without really trying, and most Fortran programmers never have any need for pointer functionality even though it was added in at Fortran 95.

Why use pointers at all? In theory, they can allow for much faster and efficient use of dynamic memory. However, while that may have been important in the ‘Eighties and early ‘Nineties, today dynamic memory is rarely the bottleneck, and the memory management of most modern languages adds almost negligible overhead to memory operations. The biggest practical reason is if you are running an application that runs in (near) real time and has to be absolutely deterministic; something like critical flight software or control of a nuclear reactor, for instance, where latencies may cause real physical problems. But these are very ad hoc systems, often using proprietary hardware that trades performance for high reliability and long uptimes, and as a consequence the software is extensively dynamically tested using a representative hardware testbed (“hardware in the loop” or HITL) to vet out both any logic problems in the algorithmic implementation and physical latencies in the hardware. There is rarely a reason that pointers have to be used in most commercial, non-safety critical applications, which is why most languages in use today have either completely eschewed pointers or allow them only within a protected and statically defined environment to prevent accidental overflows or other problems.

Rust itself isn’t any more “woke” than any computer language but the Rust community is highly vocal about progressive politics, which has butt-hurt the significant portion of the programming industry that finds the inclusion of politics in programming objectionable even though most of them are happy to go on at great length about their typically sophomoric anarcho-capitalo-libertarian philosophies and weird obsessions with Ayn Rand and late stage Robert A. Heinlein. Rust itself purports to address a lot of the problems with performance and non-determinacy of automatic memory management at the cost of having to learn a new paradigm of programming, which if that what being “woke” is, I’m happy to wake up and stop worrying about error trapping every time a pointer is dereferenced.

The “My language is harder and produces applications more likely to blow up if you don’t hold your mouth just so,” is deep within the C++ programming community, and they enjoy nothing more than sneering at anyone who uses a language with weak typing, imperative execution, intrinsic documentation and formatting, or anything else that makes a language more accessible to newbies or infrequent users. As I do most of my programming these days in Python, I get a lot of shit of this kind even though I’ve written many dozens of significant scientific and engineering applications with thousands of lines of highly robust Python code incorporating error trapping, high performance libraries, and which by the way can be read and understood as is with minimal documentation (just docstrings at this point). But then, Python doesn’t make a point of trying to intentionally create the most obfuscated code possible just as a point of obtuse pride. I’d personally be happy to never see a line of C++ (or Java) again, and if I needed to do any real bare metal or highly deterministic programming, I’d probably either go back to Fortran 90 or dive into Rust rather than wade around the the warm gasoline pool of C++.

Stranger

Discussion from a retired Microsoft engineer – https://www.youtube.com/watch?v=wAzEJxOo1ts

[Moderating]
While I’m sure that the wokeness or lack thereof of Rust is a fascinating topic, it’s a topic for a different thread.

Discussion of how C/C++ handles pointers is on topic for this thread, since it seems to be a key part of how this failure occurred. Discussion of how C differs from other programming languages in this regard is probably also relevant, because that’s likely part of why the programmers screwed up here, because they might not have been familiar with C’s habit of letting you shoot yourself in the face. But save the stuff about woke programming for another thread, in another forum (probably IMHO would be the right forum for it).

The Microsoft engineer is using the dump posted on Twitter that shows a dereference of 0x9c, and he guesses that it’s a null pointer dereference, which is a reasonable guess. But in the link that griffin1977 posted above, Tavis Ormandy notes two reasons to doubt that conclusion.

First, the faulting instruction is

mov r9d, [r8]
rather than
mov r9d, [r8+0x9c]
This is suggestive but not conclusive. But second, in another dump seen by someone else, the dereference address is 0xffff9c8e00000008a rather than 0x9c, which cannot be the result of a null pointer dereference. It seems that the code is pulling a pointer from a table and dereferencing it, but it somehow pulled an invalid pointer. It’s probably uninitialized memory, which would explain why there are different fault addresses in different dumps.

I do feel like focusing on the invalid pointer dereference is really burying the lede. Like saying someone died of “lack of oxygen to the brain” without mentioning that they bled out after crashing their plane which was shot down by the Red Baron. It’s the proximate cause of failure, but the situation went wrong long before then.

When people think of a null pointer dereference, the situation they usually think of is “programmer gets handed a pointer that might be null, and doesn’t check for null before using it”. From what I’m reading, this was not the problem here at all. The program was handed pointers that were supposed to be guaranteed to be valid, and they were not (but they were not null so they were not stopped from being used by null checks).

Presumably these invalid pointers originated in the sys file that was part of the update. I don’t think there is a really effective defence against that on the code side.

No, but that is why mission critical code and all config files should be rigorously dynamically tested before release, and the release should be verified by (at least) a matching checksum before it is distributed. Clearly, there was a breakdown in this process with this CrowdStrike Falcon update.

Stranger