Building code for internet infrastructure?

As I was hearing about the news about CloudFlare going down on the heels of the outage of the AWS us-east-1 region, I was wondering if there should be a code for internet infrastructure similar to that for the construction trades. Since it’s extremely unlikely that I am the first person to have thought of this, I was wondering if anyone knows if this is thought of as a good idea and/or whether or not laws are being proposed for this.

I used to work for a company that had to comply with PCI-DSS and know about HIPAA IT compliance requirements. I also know that Amazon and Cloudflare incur big reputational hits and possibly financial ones as well, so there is the incentive not to cut corners. However, there are companies that may not have any particular incentive to defend their systems from both accidental outages as well as cyberattacks (looking at you, Equifax).

Additionally, I have been put in charge of knowing about the infrastructure-as-code software at work. I am a developer, but I don’t have a devops or IT background and am not sure if I am following all best practices with respect to robustness and security. Nothing we do is mission-critical and all our data is public and some of our cybersecurity stuff is managed by a third party (e.g. I can’t create security groups by myself), so it’s probably fine that I’m in charge stuff, but for all I know, our nation’s critical infrastructure may be in the hands of someone who just graduated. Would it make sense to have a national information infrastructure code?

Most codes are established by professional organizations, and are then adopted by the authority having jurisdiction as being the standard. So I think it’s probably unlikely that it’s something that would come about through legislation. That being said, isn’t IEEE the organization that does this type of thing already?

A good chunk of your nation’s critical infrastructure is in the hands of people in another nation with tonnes of experience. The internet is global and much of its development and maintenance is managed by multi-national companies with offices all over the world.

It occurs to me that a “building code” would become obsolete every year on annual review.

Things move fast. Code evolves insanely fast, and especially as now much infrastructure is defined in the cloud in for example, Typescript or Python, instead of the bane of my life, Markdown.

So how do you enforce my old Markdown file, which has minor differences to my new Typescript release?

The web has always been the wild west, even though some parts got somewhat civilised.

… that said, we do have IS0 Standards, 8601 being my absolute favorite. It deals with the hardest data type, “DateTime” - dates and times which varies so wildly across regions and cultures that it is absolutely the hardest variable type with which we have to deal.

Another (somewhat related) misalignment with the “building code” analogy is that, in the real physical world, there’s almost always a site inspection to evaluate compliance prior to occupancy. There are a variety of audit standards in technology, but they’re conducted after implementation, validating that an existing production system meets the stated requirements of the standard. A “building code” metaphor suggests that the technology would need to be examined prior to launch, i.e. before it’s exposed to potential attack.

That’s obviously not at all how things work today, and it seems implausible in the extreme that it would be practical or acceptable to anyone in the technology space. It would also create a huge bottleneck, unless sufficient inspection resources were devoted to pre-launch screening. It also doesn’t guarantee a secure system; zero-days are discovered all the time in supposedly proven and widely adopted hardware and software.

i agree that the technology landscape is pretty terrible and very badly managed, but I’m struggling to find the “building code” analogy as offering a workable solution.

Absolutely. We devs do do internal code reviews, but basically that is based on part professionalism and part the “buddy system”.

There are people trying to find bugs before release (quality control) but as much as there is a theoretical adversity between the dev & QA teams, typically these days they sit together and drink beer together after work. Not really an adversarial system.

Not that I am knocking QA people, they do important work because developers can get “occupational blindness” in that we focus on the objective of the task, whereas the QA focuses on where we might have fucked up.

(ETA… I am personally responsible for losing a client roughly GB£ 60 000 by making an error in a config file that did not trigger and error until it got into production. We reversed it in 1 minute but the deployment process, with all tests running, took 37 minutes. That £ sum was based on the sales same time same day the year before)

Indeed. I’m also in tech, but on the compliance side — I’m the one who coordinates with the external auditors who issue the reports and certifications for our operations and platforms, and I’m constantly carrying out internal “soft audits” to ensure the formal inspections will go smoothly.

The effort by our technology teams is mostly in planning and architecture, not in day-to-day dev. The designers are aware of the inevitable audit, and they make sure all our waterfowl are aligned in parallel fashion ahead of time. It’s a matter of professionalism, as you say — we could, technically, launch an untested upgrade, or whatever. Nothing and no one is stopping us, except for our own awareness that this might be bad down the road.

This is where the “building code” metaphor breaks down for me. The inspector makes sure all the wiring is insulated and all the bolts are tightened before anyone walks into the structure, because you don’t want the walkways to tear free and plunge people to their death, or whatever. This is perhaps a failure of imagination on the part of those upholding the equivalent tech standards, because while an insecure database probably isn’t going to kill people so directly, the consequences in terms of life impact can be significant when tech goes sideways. Yet because it isn’t so visceral, the urgency of having a third party pre-validate tech simply isn’t there.

To be fair, the OP contemplates “building codes” not for all technology generally but for “internet infrastructure,” probably mindful of the Cloudflare hiccup, and AWS before that, and Crowdstrike before that, ad nauseum. It might be plausible to require a higher standard of quality for the core technologies whose potential failures would have outsized impacts like these; we aren’t talking about official pre-inspection for, say, the Horse Armor Patch.

But even in that more limited scope, I don’t know if the benefit outweighs the downside, in terms of slower reactivity and greatly increased overhead. I’m willing to hear the case being made, but I’m skeptical.

The saying in much of engineering is that regulations are written in blood. After each accident, new regulations are written. Civil engineering is a very mature area. The regulations are similarly mature.

The Internet isn’t. Every one of the last outages - Crowdstrike, AWS, Cloudflare - all came about from very different causes, and mostly causes that would be extremely hard to write any sort of compliance code for.

There are some aspects that could have helped. Mostly down to having isolated environments within which to test and validate changes - be they configuration, operating, or code changes before letting them out into the wild. But it isn’t as if these companies are filled with idiots. Given their individual critical places in the operation of Internet services, they are all well aware they need to take significant care. It is the nature of these systems that tiny flaws or mistakes can amplify into global meltdown. That is the magic of software.

If the Internet were a built structure it would be easier. But it isn’t. It is a dynamically changing thing.We don’t build it to code and then walk away for the next century.

Eventually what helps avoid disastrous outages are human processes. Appropriate checks and quality gates that seek to catch the looming problem before it gets away. That can help those outages that are due to changes. Lots other ways the whole mess can fail. These keep quite a few people awake at night worrying.

These major outages are notable because they’re both rare and impactful. I think both of those facets are relevant.

They’re impactful, taking down large portions of the internet, because what they provide is so difficult to do that very few companies try to do it themselves. DDoS protection at the scale and reliability that Cloudflare provides is hard, so for most sites, satisfying a “building code” to provide robust, reliable DDoS protection is going to mean outsourcing a good chunk of infra to Cloudflare anyway.

And Cloudflare (or AWS, or pick your large service provider) are on the cutting edge of tech. They’re really pretty good at what they do, compared to, say, 10 years ago. Who’s going to write the building code for them? The engineers involved in those services are the only ones qualified to even understand how to do it properly, which is why they get paid obscene amounts of money. No regulatory agency is going to be able to come in and tell them how to do it.

I’m having a hard time understanding what you’re trying to regulate. “Internet infrastructure” is so broad a concept that there’s no singular “it” to regulate. Maybe a better question is, what is the goal? Reducing/stopping downtime? Eliminating single points of failure? Stopping cyberattacks? Holding companies accountable for failures beyond market pressures?

If you said we need a building code equivalent for electrical infrastructure, since it’s very important to our day-to-day lives, I’d respond that we do have some. There’s codes for wiring and equipment, but there aren’t really laws specifying consequences for failure to deliver. Maybe there could be, but aside from mandating some amount of uptime and leaving it to the utility to figure out how to do that, I’m not sure specifying the exact equipment and deployment of said equipment is the best way to go about that. Same for the internet.

The other way the building code analogy breaks down is that the building code is only really concerned about life safety. The building code isn’t concerned with your furniture, portable appliances, finishes, or anything like that. Those are important however. Your furnace may die after two years or your cheap roofing could rot away in the sun, and while those can be catastrophic to your living in the house, the building code has no say over those things. Also as Cervaise said, it’s only inspected during construction and before occupancy. There’s no regular reevaluation unless you do a significant remodeling, which in the case of the internet would mean reapplying for a building permit every couple months or even weeks. That’s a process that in and of itself can take months or weeks for the review to happen, and I would argue that most of the internet’s infrastructure falls into the category of things that the “code” wouldn’t apply to in the first place. Hence why I’m struggling to figure out where the “there” is here.

It’s interesting that jjakucyk brought up the life safety aspect of this issue. The National Fire Alarm and Signaling Code (NFPA 72) has started to address the problem of cloud-based services for the monitoring of fire alarm systems in its 2025 edition. “Auxiliary Service Providers” may already be using cloud-based services to process alarm signals from building fire alarm systems. The ability of the system to transmit a fire alarm signal to a monitoring station might be entirely dependent on CloudFlare, AWS, or some other service provider. Supporters of this architecture point out that these service providers have huge investments in hardware, emergency power, staffing, security, etc., that can’t be matched by traditional Central Stations. But there are few regulations to ensure this. And how is this audited by independent agencies and regulatory bodies (e.g., UL). Can I start up an internet service in my spare bedroom and provide these cloud-based services? (Actually, “ZonexandScout’s Pretty Good Cloud Services and Storm Door Company” sounds attractive.) Fire alarm installers, customers, and code officials have already had some very bad experiences with system-wide failures. The NFPA Technical Committees are wrestling with the problem.

Which they most certainly do.

Exactly this.

We’ve been building shelters for how many hundreds of thousands of years? And building codes are what, maybe a few hundred years old? We had to first develop the relevant experience by trial and error.

The internet is still young. Cloudflare (and AWS, etc.) are pioneers, and they are the ones doing the experiments and learning the hard lessons that may one day become an ISO standard.

Case in point: Yesterday’s outage was caused by a botched configuration update, but the reason that update was able to bring down their whole network is that Cloudflare needed to have a rapidly-syncing, globally-deployable configuration system in order to fight rapidly evolving botnets. It’s a double-edged sword. Lose that ability (to quickly update all their interconnected systems) and that means botnets can quickly take over regional or national networks and Cloudflare wouldn’t be able to keep up. But move too quickly and tread too carelessly, and a botched update could have an even worse impact than a botnet, as happened yesterday. The trick is to invent a balanced system, with even more layers of testing and checking than they currently have, to try to find the best of both worlds. It is not an easy job at all.

It makes the news because Cloudflare is so good at what they do (especially blocking bots) that half the Western internet depends on them. Before Cloudflare — and I remember this distinctly from firsthand experience — blocking bots and scrapers was a very very difficult problem that sysadmins would normally have to tackle on their own with a mishmash of filters and firewall rules, usually barely one step ahead. There were other companies at the time trying to offer similar protective services, but then Cloudflare came on the scene and ate them all overnight because it was so tremendously better and cheaper. Over the next decade or so, we went from “many small failures all over the world, all the time, but you don’t hear about them because it was only one website” to “it’s now trivially easy for a website to stay up 99%+ of the time, and bots can be blocked with a single click in Cloudflare — at a cost of an occasional half-day global outage affecting millions of sites, every few years”.

Is that a worthwhile tradeoff? Debatable, but I definitely don’t think we’re ready to make that call at a regulatory level quite yet. The filtering systems, machine learning, networks themselves, etc. are all still constantly evolving, perhaps even faster now than before, and there’s no way our regulatory framework can keep up with the careful balancing acts required. Certainly not with the current government.

These regulations are typically more effective (as in having sufficient wisdom and teeth, but minimal blast radius) as after-the-fact lessons, not premature speculation. Otherwise you end up with things like the “cookie laws” that, while well-intentioned, are too easily loopholed and corrupted/co-opted into cookie banner spam.

Meanwhile, there are systems like Enhanced 911 that apply to cell phone carriers. That might be an interesting case study… cell networks are arguably a more critical piece of infrastructure than Cloudflare (for the time being, but maybe not for much longer?), and yet the government was able to coerce the carriers & handset manufacturers into adopting a system for life-saving purposes.

Outside of specific and hyper-targeted regulations like that, I think it’s (much) too soon to try to regulate the infrastructure into compliance. Nobody really knows all the best practices yet; they’re constantly being invented day-to-day, and very often by Cloudflare itself. In the meantime, slow standards do eventually arise, such as ISO 27001 for cybersec stuff — which isn’t perfect, and has many flaws, but it does try to be a good balancing act between “better than nothing” and “slows down innovation and experimentation too much”.


* For anyone working (or just remotely interested) in networking/devops/etc., Cloudflare’s postmortems are some of the best technical analyses of failure events that I’ve ever had the pleasure of the reading. They’re often written by their CEO, who was trained as a lawyer, became a developer, and is now leading the company. These postmortems are thoroughly readable for anyone with basic networking experience, incredibly informative, transparent, and honest. They do a lot to both engender trust in Cloudflare the company, and also help collectively propel internet best practices forward for other companies. Even more than AWS, Cloudflare is often at the forefront of global hyperscale networking, bot detection, and attack mitigation — that is their main niche, vs AWS’s rental computers business — and the rest of us mostly just follow in their footsteps and learn from their successes and mistakes. The news, especially non-tech journalists and especially TV journalists, very rarely actually bother to understand what happened. But Cloudflare’s first-party after-the-fact reports are superb and well worth the read.

I know that sounds like an ad (and I am one of their customers), but Cloudflare really is one of those incredible unicorns that does a very difficult job very well, very often. The marketshare they now have was hard-won through raw technical merit (and pricing). The level of thoughtfulness they put into the typical postmortem rivals, and often far exceeds, anything you’d typically read in a law, regulation, or standard. It is that sort of raw industry experience necessary to one day help create the codes in the first place.

[Moderating]
“Wouldn’t it be a good idea?” is not a factual question. Moving to IMHO.

Aside from the feasibility, no one wants to pay for that level of quality or wait that long for updates.

An old coworker reminded me of a story from work. It might be apocryphal. We were building IP-based phone systems where each server was 99.9% reliable and the system as a whole, with redundancy, was 99.999% reliable. When we went to sell it into China, the customers said it was too expensive. We said, "It costs this much because it is 99.999% reliable!"and their reply was “How much for 60% reliable?”

Perhaps a less apocryphal story from the Google SRE book: