Exactly this.
We’ve been building shelters for how many hundreds of thousands of years? And building codes are what, maybe a few hundred years old? We had to first develop the relevant experience by trial and error.
The internet is still young. Cloudflare (and AWS, etc.) are pioneers, and they are the ones doing the experiments and learning the hard lessons that may one day become an ISO standard.
Case in point: Yesterday’s outage was caused by a botched configuration update, but the reason that update was able to bring down their whole network is that Cloudflare needed to have a rapidly-syncing, globally-deployable configuration system in order to fight rapidly evolving botnets. It’s a double-edged sword. Lose that ability (to quickly update all their interconnected systems) and that means botnets can quickly take over regional or national networks and Cloudflare wouldn’t be able to keep up. But move too quickly and tread too carelessly, and a botched update could have an even worse impact than a botnet, as happened yesterday. The trick is to invent a balanced system, with even more layers of testing and checking than they currently have, to try to find the best of both worlds. It is not an easy job at all.
It makes the news because Cloudflare is so good at what they do (especially blocking bots) that half the Western internet depends on them. Before Cloudflare — and I remember this distinctly from firsthand experience — blocking bots and scrapers was a very very difficult problem that sysadmins would normally have to tackle on their own with a mishmash of filters and firewall rules, usually barely one step ahead. There were other companies at the time trying to offer similar protective services, but then Cloudflare came on the scene and ate them all overnight because it was so tremendously better and cheaper. Over the next decade or so, we went from “many small failures all over the world, all the time, but you don’t hear about them because it was only one website” to “it’s now trivially easy for a website to stay up 99%+ of the time, and bots can be blocked with a single click in Cloudflare — at a cost of an occasional half-day global outage affecting millions of sites, every few years”.
Is that a worthwhile tradeoff? Debatable, but I definitely don’t think we’re ready to make that call at a regulatory level quite yet. The filtering systems, machine learning, networks themselves, etc. are all still constantly evolving, perhaps even faster now than before, and there’s no way our regulatory framework can keep up with the careful balancing acts required. Certainly not with the current government.
These regulations are typically more effective (as in having sufficient wisdom and teeth, but minimal blast radius) as after-the-fact lessons, not premature speculation. Otherwise you end up with things like the “cookie laws” that, while well-intentioned, are too easily loopholed and corrupted/co-opted into cookie banner spam.
Meanwhile, there are systems like Enhanced 911 that apply to cell phone carriers. That might be an interesting case study… cell networks are arguably a more critical piece of infrastructure than Cloudflare (for the time being, but maybe not for much longer?), and yet the government was able to coerce the carriers & handset manufacturers into adopting a system for life-saving purposes.
Outside of specific and hyper-targeted regulations like that, I think it’s (much) too soon to try to regulate the infrastructure into compliance. Nobody really knows all the best practices yet; they’re constantly being invented day-to-day, and very often by Cloudflare itself. In the meantime, slow standards do eventually arise, such as ISO 27001 for cybersec stuff — which isn’t perfect, and has many flaws, but it does try to be a good balancing act between “better than nothing” and “slows down innovation and experimentation too much”.
* For anyone working (or just remotely interested) in networking/devops/etc., Cloudflare’s postmortems are some of the best technical analyses of failure events that I’ve ever had the pleasure of the reading. They’re often written by their CEO, who was trained as a lawyer, became a developer, and is now leading the company. These postmortems are thoroughly readable for anyone with basic networking experience, incredibly informative, transparent, and honest. They do a lot to both engender trust in Cloudflare the company, and also help collectively propel internet best practices forward for other companies. Even more than AWS, Cloudflare is often at the forefront of global hyperscale networking, bot detection, and attack mitigation — that is their main niche, vs AWS’s rental computers business — and the rest of us mostly just follow in their footsteps and learn from their successes and mistakes. The news, especially non-tech journalists and especially TV journalists, very rarely actually bother to understand what happened. But Cloudflare’s first-party after-the-fact reports are superb and well worth the read.
I know that sounds like an ad (and I am one of their customers), but Cloudflare really is one of those incredible unicorns that does a very difficult job very well, very often. The marketshare they now have was hard-won through raw technical merit (and pricing). The level of thoughtfulness they put into the typical postmortem rivals, and often far exceeds, anything you’d typically read in a law, regulation, or standard. It is that sort of raw industry experience necessary to one day help create the codes in the first place.