Crowdstrike failure - How? Why?

Why don’t you just get a normal setup, and make sure it never goes wrong.

Yeah, one of the big, long-standing takeaways for lay observers asking about these sorts of tech failures has to be — there’s the right way to do a thing, and then there’s the way the thing actually gets done when management finds out how much the right way costs.

A friend once said that with a sufficiently powerful CPU we could do everything in Excel.

The reality is that we need to deal with patching, hardware maintenance, power issues, etc. and there is a cost to this. Accepting an outage of 60 minutes is often the alternative.

‘Accepting’ the risks was a thing I had fun with in my last IT role; typical options for risk managment are:

  • Avoid (do something in advance so the risk simply can’t happen - like, dispose of the risky system)
  • Mitigate (do something to make it less likely to happen, or less severe if it does happen)
  • Transfer (give it to a third party to manage)
  • Accept (do nothing, just understand that it will be a bad day if it does happen)

But most of what I kept getting told by senior management was: “We accept this risk, so there’s no need for you to waste time and money mitigating it. Just make sure it doesn’t happen”

Heh. I recently took over my company’s risk-assessment function, and when I revised the board’s presentation template, I made sure to include a definition section whose first entry said that accepting a risk meant no special mitigations were in place and no further action was planned. They didn’t like that. Too bad.

“Thoughts and prayers!”

If most sales types and C-suite people didn’t have magical thinking, they’d have no thinking at all.

And Crowdstrike enters the next phase: potential litigation.

Delta Airlines is reported to have hired David Boies for a potential lawsuit. Delta’s losses are estimated at around $500 million.

Globally, the cost of the Crowdstrike matter is estimated at around $5 billion.

As of today, Crowdstrike’s market cap is 68 billion. Given they have shed well over 10% of value since the disaster, they have lost a similar or greater amount themselves. Which is somewhat ironic.

Maybe the best answer is for them to negotiate a settlement paid in shares. No matter what, getting past the looming legal storm early would probably boost their share value more than fighting ever would. But wit ould take a CEO with balls of tungsten carbide.

And I know that Crowdstrike no doubt has iron-clad disclaimers of liability built-in to every licensing agreement, as the article mentions.

But when the magnitude of the loss is so high, and the cause of the loss is such an obvious error, possibly amounting to gross negligence (the “no testing needed, we’ll just fix it if it breaks something” attitude, mentioned earlier in the thread, if proven), I could see the lawsuits getting traction.

Plus, if it’s not just Delta, but a stream of lawsuits from others, keeping Crowdstrike’s negligence in the news for a few years, that’s not good for their market reputation.

From the linked article:

Boies has represented Theranos founder Elizabeth Holmes, Al Gore in the 2000 presidential election, and the US government in an antitrust case against Microsoft in 1998.

Not a stellar track record, IMO.

Microsoft has found someone else to blame for Delta’s problems. : Delta! Says Delta’s outdated software made the Crowdstrike problem even harder to fix.

Delta denies it, and says that Crowdstrike was nowhere to be found when they needed help.

MS says they offered help, but Delta declined it.

And the litigation continues to mount: Delta the target of a class action by passengers:

That is true. In my data structures class a few of us tried to make our programs fancy and then when the professor would run his data through and there would be a zero somewhere that would crash our menu system or error catching (ironic) or something. But in this case, all nulls? No way that passes even alpha testing.

As a person that showed up to DEN at 6:00am on that morning for a flight to Rome, I can tell you that some airlines were able to adapt (OMG! Actually going to give Air Canada props on this one. :astonished:) Others had the attitude of there’s not a thing I can do about it. I wonder if that will enter into the liability somehow.

I did just learn recently that the parser is actually Regex based!

For those who don’t know, Regex is short for Regular Expressions, and is a syntax used primarily for searching inside strings of text. It also commonly will allow you to basically do “Find and Replace.” One common flaw in users who first learn of it is to try to use it to do everything.

For example, /a.[0-9]/ would match the letter a, followed by one other character, and then a numeral. So it would match “ab2” in the string “bab2cat”, “a32” in the string “x0oa32” but would not match anything in “a5od2”.

I’m not all that knowledgeable in this sort of programming, but using Regex as your parser for important files does sound alarm bells in my head. For one thing, Regexes don’t return errors.

I think Porter was the only Canadian airline hit by the Crowdstrike FUBAR.

I’m not sure what you mean by that. A regex match will certainly tell you if the string doesn’t match the pattern. That’s the whole point of a regex match. Regex has limitations in the type of pattern it can match, but if the pattern you’re trying to match is of an appropriate type, using regex is perfectly fine.

I suspect this was a packaging/distribution issue. A lot of software will go through exhaustive testing of the business logic because that’s what is most likely to cause catastrophic failure. But people will cut corners on testing the packaging and distribution because this rarely changes, it adds cost to the test cycle, and it’s uncommon for bugs to show up there.