Crowdstrike failure - How? Why?

They could be very well running on a NUC or stick PC. You can run a basic version of Windows on those.
Best answer would be a teensy server box with a remote management interface. They exist for just this purpose. You just don’t see them in your usual corner computer store. You can get industrial quality ones for not much. What Windows brings to the party is their enterprise remote management systems. There is no way you would want to be managing hundreds of display screens by remotely logging into each box. Sure, there are tools to do that with Unix systems, but you don’t have the level of support and integration MS can bring to the party. You can be sure it has nothing to do with the hardware solution and everything to do with the software.

You would hope. But short of getting into fibre repeaters for HDMI or DisplayPort, you are length limited on any video cable, and the whole prospect of managing more cable runs, and placing the driving computer somewhere safe nearby likely makes the installation less reliable and more pain. You take an informed guess at the most likely failures and design for them. Borked software was probably a lot further down the list than all the problems managing two bits of kit and interlinking cables bring.

Even micro desktops are available with tech like Intel’s AMT.

This allow out of band management i.e. control of the keyboard, mouse, and power outside of the OS. It’s available on most corporate and enterprise computers and typically adds $15 to the cost.

A couple of very good videos that provide a top to bottom analysis of the Crowdstrike mess. The second is more technical. Overall I can’t fault the commentary given.

(In the first video Dave does go into a commentary about Falcon maybe using p-code from the config file to effectively run externally provided code. Saying p-code is an odd thing. P-code was the intermediate code emitted by the original Pascal compiler, and is one of the early virtual machine systems. It assumes the listener has some pretty arcane knowledge. A lot has happened since then, and we have lots of virtual machines, with different instruction sets, and in many cases much safer operation. The most ubiquitous being the Java Virtual Machine, that doesn’t just run Java, but runs most of what makes Android Android and not just another Linux clone. It is an interesting idea, but I’m not convinced. Nobody knows what is in the config files, and I would minimally assume it is encrypted and includes serious integrity checking.)

Yes, exactly.

And more than that, poorly secured kiosks or signs would be excellent recruits into a botnet that did evil things completely invisible on the public display screen. After all, the processing effort required to display the legit contents are a tiny fraction of what even a modest processor brings to the party.

How many such kiosks and/or display signs/screens does one airline or car manufacturer or major retailer like e.g. Target own? If e.g. Target screws up and leaves theirs susceptible to botting, then they’re likely all equally bottable. Once a bad actor finds that, Target’s entire fleet of umpteen thousand machines are bots within just a few minutes.

Proper network design would provide at a minimum VLAN isolation and a firewall between IoT devices like this and the rest of the corporate network like the POS terminals, but I don’t have a lot of faith in large orgs anymore.

I have some signs that are running Crowdstrike. Fortunately they are Linux based, so were not affected by this incident.

The signs need to pull information from another system for live display of camera feeds. It doesn’t matter if the signs are on their own vlan, or otherwise isolated (those things are all good), the signs still need access to the secure systems (read only) in order to display the live feeds.

At a minimum, if someone was able to access the signs remotely, they would be able to extricate camera feeds that are only intended for local consumption (though I’m not sure what use that would be). Much more damaging would be if there was a way to escalate the read-only access to the security feeds to administrator access. Yeah, hopefully the security feed computer is secure, but defense in depth, and if you don’t trust Crowdstrike (obviously not), why would you trust the camera vendor to build a hack-proof system.

So yeah, signs run Crowdstrike. In my organization, if they’d been Windows based they would have run Microsoft Defender for Endpoints. Regardless of how isolated and firewalled the signs are, they’re either completely off the network (useless for our purposes) or on the network, and do you really want something on the network that goes unmonitored? The promise of Crowdstrike and similar is to alert you if the sign starts doing something it’s not supposed to do. “Hey, is that sign supposed to be port scanning the internal network?”

Oh yeah, and calling the signs IOT devices is a bit dismissive. They’re actually Linux systems running on $120 Intel based boxes with 8GB RAM and 256GB of storage. Certainly not useful as a crypto miner, but vastly more capable than a lightbulb or thermostat (at computing, not at messing with people).

It’s not a judgement call, an IoT device is designed for limited functionality rather than for a general purpose user. It could be an e-ink shelf tag, an in-store advertising display, or a store directory kiosk with a touch screen and anything from a dedicated light switch to an Intel box running Windows. Microsoft even sells Windows 11 IoT Edition. There should be a stateful firewall that only allows connectivity from the VLAN to specified ports and IPs for the traffic required for the job.

Yes, I’d classify my signs, even if based on a general purpose computer, as IoT. It’s just that hidden in the background is a fully functional x86_64 computer with all the power of a laptop from 2018, not some embeded SoC, that’s going to require an attacker to sort out a RISC-V rootkit in 4MB of RAM. Of course the install is stripped down, with only the necessary pieces to run the sign available, unfortunately those necessary pieces are a graphical environment and web browser.

So yeah, do the things to make it hard for someone to get access, and also run Crowdstrike or some other software to notify you if things do go wrong.

Crowdstrike has posted a preliminary report on what went wrong. They spend a long time explaining what all the different parts of falcon sensor are called, but then they get to what went wrong.

One of the definition files, which describes the type of checks to run, was corrupted. That file is not executed, but it is read by another program, and the corruption caused an out bounds memory read. Windows was not able to recover from the error, which caused a BSOD. This problem should have been caught in testing, but a bug in the testing let it pass.

If I read correctly, Crowdstrike normally runs their own software, but for some reason didn’t run the updated definition files before pushing them out. Possibly because they had previously updated them without any problem. Also, they release the updates to all customers at once, instead of a phased roll out.

But they’re sending out $10 Uber Eats coupons to make up for it.

The commentary also skimmed past a snippet about a bug in the test system.
Which would not be a huge surprise. This gets us back to the Swiss Cheese analogy of failures.

You would expect that they run their own system on their own machines. Perhaps they are a Linux shop internally. But the old adage about eating your own dog food should apply on all fronts.

Yeah, it’s easier and cheaper to send data over a distance to a client machine over ethernet than it is to send video over a distance to be displayed on one screen only (broadcasting is a different thing). It doesn’t help that the industry pretty much normalised all of this by creating an abundance of very small, power-efficient client machines that are specifically designed to fix on the back of a display.

There is what I believe to be a fairly well balanced and grounded analysis of the problem here:

Short version: yes, it’s because Windows works in a way that makes it easier to mess up than some other OSes, but Windows works like that because everyone wants it to work like that.

There are solutions for that sort of thing. Back when I was working in IT supporting (amongst other things) public libraries, we used a solution called DeepFreeze where a client machine would be locked down in such a way that it didn’t even matter what happened during daily use because rebooting it would just revert it back to the ‘frozen’ image.

At the precise moment they are deployed, on day zero, such systems may appear to be complete and perfect and invulnerable - all known methods of interference could be shut out; nobody can change anything on it, right now.

…except such things still require routine maintenance - for example to patch a newly-discovered bug or vulnerability in the OS; on day 1, some enterprising individual discovers or creates a new way of slipping past all existing/known measures and controls; a brand new way into the system (this isn’t rare at all, on any OS).
So you now have a machine that, whilst being impervious to all previously-known attacks, is additionally impervious to being patched against this brand new attack - now, nobody can change anything on it, except the bad guys. because you locked it down in a way that nothing (that you knew about on day 0) could change it.

It might seem like the answer to that is ‘well, just make it so nobody except me can change it, but I can change it any way I like’ - which is pretty much how administrator roles and privileges work already, but then all the bad guys have to do is figure out a way to pretend to be you.

There is always a compromise between security and usability. You can make a computer more secure by disconnecting it from all networks and locking it inside a vault, but then it’s not much use for anything on a day to day basis/

The ecosystems for the dominant OS’s is remarkably different. MS realised early on that backward compatibility was a very significant strength in the market. But that expectation became something of the tail wagging the dog.
Linux variants take the directly opposite approach. Android is a different beast again.
Apple realised that they could split the difference, and provide limited backward compatibility and sunset it with a hard cutoff. The manner in which they transitioned between different instruction sets was pretty impressive. This ability to just declare a shift in the OS API causes some squeaking (like the device driver change some years ago) but lets them move forward in a manner that MS find much harder.

The manner in which Apple provide a user mode environment for security add ins like Crowdstrike is interesting. I suspect it is based on the microkernal structure of Darwin (aka Mach) which is something dear to my heart.

Enthusiastically agreed. I’ve spent thirty years in various roles in technology companies, and if I had a krugerrand for every time I heard some business-side stakeholder say something like, “why are you making this so complicated, it seems simple, you should just do X,” I would be an honorary South African banker.

(Example: A while back, it took me three weeks of patient, repetitive explanation to get an executive steering committee to understand that their personal experience with antivirus software on their workstations was entirely disconnected from the difficulty and complexity of integrating an AV tool into the background stack of a cloud-based enterprise app. Eventually they got it, but it took a long time.)

For the benefit of the less-technical reader who thinks this kind of thing should be easy, let’s dig into an illustrative example — those overhead screens at the airport showing flight status and departure times.

The monitor has one job, to render a table of flight information. That’s their only job. They don’t do anything else. You don’t want them to do anything else. You would think you could make a machine that does that one thing, and disallows any other kind of operation. That would result in a perfectly secure, highly-limited piece of technology, which would make defensive software like Crowdstrike irrelevant. Right?

So, let’s think through it. (This will be deliberately expressed in a simplified way which might irritate technically savvy readers who will notice where I’m compressing concepts and jumping over steps. I know this. Please don’t yell at me.)

First, consider the architecture. Is this a dumb monitor, which is displaying only what is being broadcast to it by a computer elsewhere? Or is it an actual limited-purpose computing device with a small onboard program, receiving and parsing packets of schedule data and transforming them into the on-screen grid? There are advantages and disadvantages either way. If it’s a dumb monitor, then you just push the vulnerability upstream to the central computer (and that, by its nature, pretty much has to be a proper computer, not just a limited-purpose device). Also, if it’s a dumb monitor, then it has to receive a display signal somehow. If it’s wireless, then that becomes a vulnerability; somebody could interrupt and hijack the signal and make the monitor display something else. This wouldn’t be good for much more than just mischief, but you still don’t want that. The alternative is a hard-wired connection for signal transmission, which requires laying and maintaining literally kilometers of cable from all the screens to a distribution station, which quickly becomes cost-prohibitive.

So let’s say you don’t like the purely-a-dumb-monitor solution, and you go with devices that have limited onboard functionality. To begin with, they cannot be strictly isolated. They need to receive packets of schedule data, as flight status changes minute to minute. They have to be willing to receive data over some sort of open connection, or they don’t serve their intended function to show flight status being updated. This constitutes a potential vulnerability, yes? Okay, but the computer is just receiving a structured data file, and it’s transforming the structured data into the visual table of departure information, and that’s all it’s doing, so this should be simple, right? Not necessarily. Any time you have data incoming like this, you need to consider overflow attacks, where somebody figures out how to flood the intake channel in such a way that stuff begins spilling outside the normal boundaries, and overwrites things in memory that weren’t supposed to be overwritten. In other words, you may put your simple display program on the device so that’s all it can do, but after an overflow attack, some or all of your simple program gets replaced with something else.

Okay, so let’s lock down that display program so it can’t be affected or altered by an overflow attack. There’s a thing called protected memory, where you authorize your various processes to take action only in certain areas and prevent them from acting outside those areas. This is pretty standard nowadays, and there are several ways to approach it. They’re not foolproof (e.g. if you use some form of key-based permissioning, you have to guard against key breach), and some sophisticated attacks have been devised which get around the defense, but this does block the more elementary intrusion attempts.

That sounds complicated, though. What if we totally lock down the display program by essentially hard-wiring it? Consider an old Space Invaders stand-up console, where the game’s code is permanently written on a handful of physical chips. There’s nothing rewriteable about this, so there’s nothing to attack. Yes?

Yes, that’s true, but now you’ve locked yourself out from making any changes at all to the display program. Once encoded, that’s it, it’s static and unchanging, unless physically replaced. So if something happens in the future that requires you to change the display program somehow, you can’t do it. Let’s say your airport started small, and your program switches back and forth between two pages of data; but now you’ve grown, and you want to cycle through an arbitrary number of pages based on the number of flights. Or, let’s say there’s a regulatory change which requires airlines to state not just whether a flight is on-time or delayed, but also how much it’s delayed; you need a new column in your display table. You can imagine all kind of scenarios which require you to be able to update your program in response to evolving requirements. But if you’ve locked the machine down with unchangeable read-only memory, you can’t, unless you’re willing to physically rip out all the devices and replace them with new models.

Okay, so go back to the protected-memory idea. You’ve got a simple device which is capable of running this small info-display program, and you keep that small program in some sort of protected memory, isolated from the part of the computer that ingests and parses data packets and passes the structured information to the display renderer, so an overflow attack on the intake channel shouldn’t be able to affect the program. Right? Yes, great. But — you still need to be able to update that protected program, whenever the display requirements change. So there still has to be a way of accessing or unlocking the protected memory and deploying a new program. That, obviously, constitutes a different kind of vulnerability.

Okay, so, let’s put in a parallel watchdog program that monitors the display program, and prevents it from being changed unless the change request is confirmed to be legitimate. How would this work? Maybe it has its own copy of the display program for comparison; anything that isn’t part of that reference copy is disallowed. Okay, but any time you want to update the actual display program, you first have to update this reference copy used by the watchdog program. That means the watchdog program itself has to be accessible and updateable, which just moves the potential vulnerability somewhere else.

All right, well, what about a watchdog program that watches for any changes, and blocks anything that doesn’t have some form of authorization? Say, a combination of a security key and an originating internet address, which tells the watchdog program, this is a valid change; anything else is seen as a potential attack, and is refused. Right? Where do we find this kind of watchdog?

Well, hello, Crowdstrike.

That’s why this stuff is so difficult. As Mangetout suggests, the more you lock down a system and prevent it from doing anything, the less useful it becomes. Any time you want a system to be able to do anything useful, by definition that capability becomes vulnerable to exploit. It’s inherent and fundamental, and there’s no way around it. And so these programs and devices need some sort of defense if they’re going to be deployed in the world. It’s inescapable.

Absolutely. Apple in particular made it something of a feature that you have to periodically just line up and buy a new one of the thing you already have; they found (or perhaps created) a customer base that loves doing that. It’s a lot easier to make sweeping changes to your security model when you can decide to just leave the past in the dust, but that lack of continuity is part of the reason why Apple isn’t as popular a choice as Windows when it comes to deploying widgets to run information screens - people want a solution they can install now and still be able to add or replace widgets-on-screens ten years in the future.

There are intermediate options. For instance, you could put a program in hardwired or protected memory that does nothing but displaying a page represented by a simplified subset of HTML code. That could be a very simple program, simple enough to be (relatively) easy to (mostly) secure, but it still gives you enough flexibility to change the format of your table of information you’re displaying. And since what you’re sending to the device is just a text file (also simple), it’s also easy to secure that.

You do still have the system which sends out the text files, but that’s presumably the same system which communicates with the airlines and the control tower, so they can tell which planes to go to which gates. You have to secure that central system, but you already had to secure it, with or without the user-facing displays.

Yet another option: Take two or more of those different options, with completely different architecture, and deploy all of them. All of them will still be potentially vulnerable to attack, but they won’t all be vulnerable to the same attacks, and so if one goes down, you probably have others that are still working while you’re fixing that one.

Though of course, this still has its issues. It’s OK-ish if one of your systems just stops working entirely, and your customers just go down to the next bank of display screens to see their flight info. But what if an attack has the effect of making the screens display something that looks right but isn’t? Even if people notice that different displays are showing different information, they won’t know which one is right.

Oh, sure, of course, it’s possible to come up with a workable and reasonably secure solution. But the point is, you have to think through the implications and design it carefully to get to that solution, versus the knee-jerk “just make something quick and simple!” demand we get from business-side stakeholders who don’t understand technology analysis.

Maybe I’ve just been doing this too long, but in my experience the hard part of projects like this is not solving the technical problem, it’s explaining to non-technical people why this solution should be preferred, and why shortcuts are inadvisable.

Having backup systems that work on a completely different architecture costs more to implement and support. It’s a sensible thing to do in highly critical settings like the control tower itself, but the public-facing information screens are probably not an example of the kind of environment where a for-profit business would find it tolerable to maintain two parallel architectures.

Sooner or later, but probably sooner, some bean counter is going to ask “Why are we paying for two different solutions that do the same thing? Why are we employing two different-but-basically-similar skilled sets of devs? Why don’t we consolidate those and save money?”

In that scenario, arguments about failsafes probably don’t work unless something like Crowdstrike has already happened in the recent past (and not always even then - I worked for a company that had a site that had burned to the ground within the memory of most people working there, and still there were questions about why do we need to spend all this money on sprinkler systems and waste a quantity of square feet of floor that could be used for potential storage space, just to keep a fire door clear).

I’ve been in this business for 30+ years. I can’t tell you how many times our customers have told me during the analysis phase that they need high availability. Once the architecture is drawn out and priced, most of them backtrack on that requirement.