Crowdstrike failure - How? Why?

This can be the big issue with security updates. Apply them now without much testing, or delay to test and leave yourself vulnerable to a flaw which is actively being exploited. I’ve not heard anything to suggest that this Crowdstrike update was specifically to address an active problem.

I’ve seen some people, here and elsewhere, being smug with various degrees of “I delay patches as long as I can” or “I’m still running Windows NT”, and it’s that kind of thinking which allows breaking into Exchange servers that are vulnerable a known and patched exploit. Security is an actively managed process, not a set it and forget it script, or completed box on a check list from 3 years ago.

Saying that updates without testing is irresponsible ignores the counter argument that going unpatched (for some updates) is irresponsible. Of course the answer is often in the middle. Sometimes testing might literally be a matter of a few minutes. Patch one system, it comes back and seems to be working, move on to the rest.

And you could flip that on its head and say that the obsession with immediate automatic security patching ignores the counter-argument that untested updates could brick machines worldwide. That’s actually a real thing that happened! I don’t believe any of the affected companies are saying to themselves “at least we were up-to-date on low-priority security patches”.

It would be minimal cost and effort to set up an automated test lab that auto-updates Crowdstrike patches followed by a reboot every day. If all the lab machines come up, then the production machines can get the push automatically. If not, then it sends out alarms for manual intervention. This is just basic IT management at scale.

This is why security patches are implemented ASAP

So logically, they tested the update but somewhere between testing and putting in the “out” tray to send to all the customers, it got changed inot an all zeros file.

Yes, elementary computer security says that each program has its own playground (memory, resources) and cannot interfere with other programs. Your Word program cannot decide to read keystrokes in your browser or start a connection to another computer and download something (usually) unless it was designed to, your browser cannot decide to start formatting disks or delete random files. (In each case, accessing another program requires it to ask the operating system, which typically asks you if this is ok. When they do display this unsafe behaviour, it is a bug that needs to be patched.)

Occasionally you will run across “you must run this program in administrator mode” - the program can only do certain things if it is deliberately run with those privileges.

The problem with AV like this is they are the cops that determine what the other less privileged programs can do or not do, and will jump in and stop things if they detect anomalies. If they mess up, anything can happen. In this case, it was the simplest mess - the program crashed on startup.

From what I read, this was the second most popular security program, so that’s why it had such a wide effect. And unfortunately, when the computer won’t boot, it takes a bit more effort (manual intervention) to get it up and running without the AV to the point where the problem can be fixed.

Which is the part I do not understand. There are very few programs out there that are all nulls. It would be like shipping a new auto from the factory with all the doors missing. The most cursory of final glances at the code before releasing should have caught this.

Thank you. I like the simple here.

That’s why I was suggesting something went wrong with the software that actually deploys the update to the masses. That, on their storage, they have a proper driver that works fine. But, when it got packaged up and sent, something went wrong and it wrote in a file of all nulls.

You still should be able to catch that, by deploying that update to a small number of testing computers, and then rebooting so that the kernel driver gets loaded. But they may have never actually seen the all nulls file.

(Sure, this seems to be a race condition issue, where sometimes the file gets loaded before boot and sometimes it doesn’t. So it would be possible the computer would reboot just fine. But then the software should still be broken, so they’d detect that.)

So I can only think they didn’t test the update deployment. Or that someone did something that bypassed the normal checks before the update was deployed.

Though the null pointer reference is also a big deal. Ideally, they would check the file, get a null pointer, and then gracefully back out, rather than crashing. But if you never expect the file to be all zeros, I could see that being missed.

Though there a bunch of other circumstances where that would result in getting a null there, e.g. if the file got corrupted or the drive failed.

Also presumably they didn’t get the null directly from the file, as that would mean storing a memory pointer directly in a file which would never work, as the memory layout will change between runs (even in driver programming AFAIK). I’d guess they are decoding the value in some way, to yield a pointer, and if that decoding fails it gives a null pointer. Not checking for null in those circumstances is just crazy

It’s a little surprising that they didn’t package it up, get a hash of the package, then run that exact package through their basic smoke-testing suite which should have spotted this issue. Also, you’d think that a package with kernel-level privileges would have some kind of checking of the signature or hash of the package before installing, or you’d have one heck of an avenue to install a killer backdoor.

Either they repackaged after doing their tests, or they don’t do any testing whatsoever. Either mistake would constitute a hideous failure of the release process.

I don’t know enough about kernel-level programming in Windows to speak definitively to this, but on other systems it is possible to map a file into a specific area of the address space, at which point its contents can be used as raw pointers - as long as those pointer values are calculated very carefully, of course. Given that the security software needs to run with very little overhead, I could see making similar shortcuts to avoid the pointer swizzling that would take more of that overhead.

More details here:

Specifically this:

Channel File 291: This specific file contains rules for evaluating named pipes. The faulty update added logic that inadvertently caused system crashes.

So it seems the code is expecting to find some rules encoded in that file, but instead is finding all zeros, which makes no sense, and so returns null. Then no one is checking if the result is null which is crazy.

Like I wouldn’t be able to check in code like that I’d get a snotty email from GitHub saying the checks hadn’t passed. And I don’t write driver software.

But in that case zero would be be an invalid null pointer, it would just be the start of the mapped file

You would think that there would be some kind of inherent error trapping for any kind of memory access call that would crash the program if it were improperly defined, or if the data being dereferenced was not the correct format. But there is a lot of really shitty C++ out in the world written by people who just assume that reference errors and memory leaks are Somebody Else’s Problem.

Stranger

It says it was a logic error, so it is returning a null pointer, but that might just be from an error, not a file corruption.

According to the post-mortem linked by @griffin1977 it was a logic error, not file corruption. Of course, it is possible I misread it or it is wrong.

The post-mortem did say that it was an update to detect a new way malware was using named pipes. There was no indication whether it was a time critical update to stop an ongoing attack, or just the general churn of keeping up with what the bad guys are doing.

It also said these were not kernel level drivers. However, if this part of Falcon Sensors was inserting itself in every named pipe, and it was broken, then it would interfere with different processes communicating with each other. That might be enough to crash the system, because all the kernel might know is that everything in user space is going haywire.

My bet is that there a function like this:

PipeRules* readRules (const char * filename);

That returns null if it fails, but no one ever bothers to check for null and no one ever has, and it’s never failed before (or never failed for everyone, the odd failure over years when a disk fails or whatever). But then this is update caused everyone to take that code path.

It was both as I understand it. There was a logic error that was revealed when an invalid file was sent out. The code was not checking if a pointer was valid but they got away with it as the pointer was never null, but then the file was screwed up and suddenly the pointer was always null

Unless the rules are generated by a tool, I’d bet the people making the rules file hit error conditions often. And if they are generated by a tool, then the people making the tool generator will hit the error.
I’d be shocked if this condition hadn’t come up in some development team.

(I am actually a kernel and driver developer, but play in Linux these days. The last time I did any Windows style stuff was in my cell phone days, with the embedded version of Windows)

It’s odd. I suspect there may be some degree of euphemism happening (see also: “The update… [ resulted ] in system instability.”)

The part I don’t understand from the Technical Post-Mortem page (which copies an entry from CrowdStrike’s blog) is

Preliminary analysis indicates that the logic flaw in Channel File 291 was not related to null bytes or the structure of the file itself.

So the only other thing I can think of would be the file’s name. That would be an odd thing to trigger a logic error, but it could happen, especially if part of the name is generated from the file’s contents somehow.

I also noticed that. I read it as not null bytes or invalid structure, in that it wasn’t all zeros and the layout was correct (as in it was meant to be 1024x32 byte entries, or whatever, and it was) but the data itself was corrupted, even if it wasn’t all zeros. As in they didn’t ship a file that was blatantly incorrect by being all zeros or the wrong size.

Right, but - the guy in the video linked above in post #3 by Stranger_On_A_Train shows what he claims is a hexdump of the file, and it shows all zeroes, and he claims that the entire file is all zeroes.

So either he was just wrong, on something that could be easily checked by pretty much anyone with that file on their system, or what CrowdStrike’s blog is saying is somehow evasive yet technically correct.

Or - and this just occurred to me - perhaps CrowdStrike prevents other processes from opening and reading its own data files? Perhaps it changes all references to those files to the Windows equivalent of /dev/zero ? It sounds oddly like the sort of thing a malware infection security software suite might do to protect itself from being spied upon.

Now I’m really getting paranoid.