How much work is it to run an Internet platform

Following on the discussion about Twitter, I wondered how much work it really is to keep a platform running. I’m not only thinking of social media platforms like Twitter, but also websites/intermediaries like Booking.com, Airbnb, Amazon. For the latter kind it seems as they merely (?) need to keep the software running once it is developed. So why do they need tens of thousands of coders and engineers?

Is the software that complicated? Do they still need to do much new development (I don’t see many major changes on websites like Booking.com or Amazon but maybe there is a lot of work at the backend)? Is there so much work in maintenance/updates? Or are they doing cutting edge research in AI to keep you coming back?

I’m not talking about support staff for human interaction, compliance, etcetera. The IT workforce alone seems huge for these platforms, particularly if you compare it to something like Mastodon which - if I understand it correctly - runs on the effortsof a handful of people. Of course Mastodon is not as large and therefore does not need the upkeep of massive distributed server farms, but the code itself already seems to provide a lot of the Twitter functionality.

Yes. It consists of many, interacting parts. Things built on top of other things. Things interacting with other things. Things developed in-house, things bought and things available for free. Few pieces of which have been tested for every possible interaction and state, and most of which get more or less frequent updates for one reason or another.

So shit happens and they are fixed as best as one can without taking the whole system down to work on it for a few years.

Who knows. One thing is, if they wait until there’s an obvious need for new development it might be too late. Another is there are managers and coders aplenty who would be out of a job if they said “I guess we’ve completed all necessary new development” who will argue strongly about the first point. A third is that there is a lot that happens that isn’t visible. Amazon for instance has a lot of people working on the best way to do searches that a) give you what you were searching for, b) give you what you actually needed and c) mixes those with things you weren’t searching for but might realize you want if it pops up in your search results. As well as people figuring out how to best store all the information they are storing on your shopping and browsing patterns.

Yes! For one thing no big platform is without bugs, and bugfixes sometimes create new issues. Stuff that isn’t made in house might get a security flaw fixed, forcing an update and that might break something else. People interact with your service through browsers that also keep changing. Etc. etc.

Mastodon has hundreds of instances, each of which requires the attention of one, two or a handful of people to maintain. And it provides none of the functionality that makes Twitter a (borderline) profitable company. It does no adds, it doesn’t analyze user behavior, it makes no recommendations, it has a deliberately limited search system. It’s not a simple or uncomplicated system, but it lacks a lot of the interacting moving parts of something like Twitter, let alone Amazon, and as you add parts the possible problematic interactions and the need for maintenance grows exponentially.

A big challenge is keeping your web site current. Our school district platform, which is very limited in scope compared to, say, Amazon, needs constant updating, revising, tweaking, whatever because incorrect information can be as bad or worse than no information at all.

There are constant new features, improvements, redesigns, and fixes going on for any major site. Amazon in particular is constantly getting new content and features. I don’t use Booking but other sites like Hotels, Expedia, etc are constantly being updated. If you use the site to do the same thing you might not notice them.

I’m sure every significant site has a long list of new features, technical debt, or performance improvements waiting to be implemented.

Chrome is in version 107.

Firefox is in version 106.

Edge for Windows is in version 106.

Actually, the latest Edge is 106.0.1370.34. Each small iteration requires teams of people, from designers to spec writers to coders to testers to QA to manual writers and training of help desks. Large version iterations require all of those plus making sure the hundreds of smaller changes fit into the brand new leaps and bounds. Millions of lines of code have to work on every device known in the world and allow integration with every major piece of software, whose makers are bombarding them with questions and suggestions and demands all the way through. And every time a new device is born, a new set of problems is introduced. Moreover, the future teams for version 108 and 109 and 110 are separate from the people maintaining 106 and 107 and the Windows teams are separate from the iOS teams and the other platform teams and so forth and so on.

The best analogy is a body. The enormous growth from a baby to an adult is visible to all. Yet the adult changes further every minute of every day. The process can never stop.

I’m not sure whether I misunderstand you or whether we are talking at cross-purposes. My OP was intended to ask not about browsers working on hardware platforms, but certain types of corporations/websites that are also called platforms.

I agree that rolling out a new browser is a lot of work as it needs to work on all kinds of hardware, but websites (in theory, at least) only need to work on the major browsers as the browsers do the work to match it with hardware.

Although browser development might not be a good parallel, these platforms do need to run on different kinds of hardware. Even if you own all your servers an expansion might require buying a new model of hardware, which might have unforeseen interactions with the OS and your services on top of it.

I would think security would involve a rather large team to keep the platform up and running, by implementing and maintaining a number of systems and processes. You can’t have a major platform constantly being hacked or sabotaged and keep your clients happy.

Constant development and maintenance requires a large infrastructure to prioritize, design, develop, test, implement, and monitor every little thing. If the platform does financial transactions, that’s probably a whole other large team. If it’s a platform that shares loads of content there will need to be teams responsible for validating, scrubbing, and keeping that content up to date - all involving lots of people and process.

Some of those are just as complicated as browsers. You don’t see the iterations but some roll out new code every 2 week sprint. The browsers are good comparisons that most people are familiar with.

It really, really depends. But I do think modern tech companies have become bloated from years of easy money and almost unlimited capital.

I have built and maintained websites. Static sites are not hard, and the biggest manpower sink will be editorial, keeping the content up to date. But the actual code behind such a site is trivial, and cloud hosting takes away most of the scaling challenges and need to manage your own server farm.

Dynamic sites vary greatly. Some are heavily integrated into cloud backends, have extensive APIs that need to be maintained, etc. Some of them are kludges that are the result of years of coding by mediocre developers, and need large trams just to keep them from falling down.

But there is a limit to how many people you need, and some web companies are really bad at finding it. So you get makework, or allow all kinds of experiments to keep people busy, or you build lots of marginal features no one cares about. Or your coders spend half their time playing fooseball in the lounge.

The occasional recession tends to clear out all the deadwood and cause management to refocus priorities. Then things get healthier until the next tech bubble.

There’s a difference, of course, between building something, and building something at Twitter scale. Twitter doesn’t host stuff in the cloud, it build and maintains its own datacenters. The software has to work, and then it has to work reliably with millions of concurrent users.

To the OP, the core of Twitter, that is being able to tweet and read tweets and whatnot, plus the infrastructure required to do that at scale, is probably a small fraction of Twitter’s overall engineering efforts. In terms of lines of code, most of the software the company has written would be to support internal functions – content moderation, ad sales, FTC and legal compliance, accounting, etc. That’s how most big companies work. If you slash your content moderation team, you can also probably slash all of the engineering teams working on software to support content moderators.

Thing is, all of Twitter’s competitors are constantly innovating in those spaces. The Twitter ad sales engineers are effectively competing against the Youtube ad sales engineers. The longer Twitter lets those functions atrophy, the worse their competitive advantage will become.

You mentioned sites that have millions of users. The principles are the same, and scaled down only slightly. Browsers have the useful quality of publicly announcing their version numbers.

No huge, major website can afford to sit on their software. Like sharks, they swim or die. The exact number of people needed to keep them alive depends on the usual million individual circumstances but is always going to be much larger than outsiders realize.

All kinds of external changes require adjustments to a platform. I’m thinking of things like:

When they changed the date that parts of Canada and the USA went from daylight saving time to standard time.
All sorts of calendar and scheduling apps and data needed to be adjusted, up to and including payroll and security timekeeping. (At my work, when the clock changes, they have to do weird and interesting things with the overnight shift’s timesheets…)

When Europe implemented its data storage regulations, which I believe require clients’ personal info to be stored on servers physically located in Europe. That might have required setting up a new physical facility, and then you have to transfer the appropriate data to it…

Then there are changes of scope that sounds simple until you start looking at the details:

What if you are a US company and you decide to sell in Canada? You suddenly have to be able to accept Canadian shipping addresses, which have alphanumeric postal codes in a specific format. And your user interface had better call them postal codes, not ZIP codes. Can your database handle it? And then there’s calculating shipping costs…

Another example that is bedeviling Amazon right now:

The US ISBN Agency, a company called Bowker, has recently started handing out International Standard Book Numbers (ISBNs) that start with the digits 979. Previous ISBNs started with the digits 978.

An ISBN is a 13-digit number that uniquely identifies a specific edition (language, format, binding, etc) of a book. Every country has an agency that hands them out to book publishers.

There was an older format of ISBN that had only 10 digits. It is possible to automatically convert a 13-digit ISBN that starts with 978 to a 10-digit ISBN, and vice versa. But a 13-digit ISBN that starts with 979 cannot be converted to a 10-digit ISBN.

It so happens that Amazon had been using 10-digit ISBNs internally to keep track of the books they sell. They had been auto-converting the 13-digit ISBNs they received to the older format for use in their internal ordering systems.

And then someone showed up wanting to sell a book with a 979 ISBN through Amazon’s system. The ISBN couldn’t be converted to Amazon’s internal format, and Amazon’s system couldn’t order it from other distributors. Suddenly customers were unable to order books through Amazon that they could order elsewhere. And publishers and authors were losing sales and they didn’t know why.

This is the sort of problem whose fix would require rebuilding major databases and internal systems. It could take years, partly to test it all, but also because existing systems have to keep running.

I believe that this has been partially fixed, but I still see reports of problems for external publishers distributing through Amazon. The problem doesn’t affect me yet, because I am Canadian and we aren’t handing out 979 ISBNs …yet.

Entropy requires no maintenance.

Any publicly facing internet platform is eternally a work in progress. The level of interdependence is mind blowing. There are few solid standards, but many defacto standards and many evolving standards. The WWW is fast moving to a world of HTML-5. What once worked now doesn’t and stuff written today will stop working if you leave it. There are a huge number of platforms, development environments, libraries, APIs and what have you. Just taking an existing system and getting it working in the face of conflicting version requirements of all the parts can keep a room full of programmers employed full time. Just releasing updates into a live system is going to keep a lot of people going. Breaking the entire Internet isn’t out of the question with only the slightest glitch.

Software is a difficult beast, its scaling demands don’t look like most other systems.

A company like Amazon has little choice, it is so big that it is responsible for almost the entire shmozzle. Where Amazon had their massive stroke of commercial insight, was to realise that they could sell their expertise and capability. So using Amazon Web Services, or the competing services, allows the minnows up to medium sized fish in the sea to avoid a lot of the ongoing heartache and development time getting a Web presence working. The scalability Amazon worked so hard on for their own offerings has enabled any others to thrive. But Amazon now needs to both develop and maintain the software and infrastructure for not just their store-front, but for all their AWS customers. Which is where they make there money.

Scaling for the big dogs is really hard. I saw an estimate that Twitter’s Internet bill is about $1.5 billion a year. Which is hard to get your head around. But gives an idea of the scale of communications. How you even construct a system to provide the services needed at those scales is a serious question all by itself. Twitter probably went through a number of iterations of internal architecture coping with total failures of scalability as the business grew, and this won’t have stopped.

Running a web facing system that serves a small clientèle on a single box isn’t hard. Once the communications saturates one box, you suddenly get into a new world. And as your customer base expands, it gets worse. People’s expectations change, and what was once cutting edge is now mundane. Making stuff just work in a manner that even the most techno-illiterate can manage takes effort, and making it just work at scale is even harder.

Companies like Squarespace provide a really valuable intermediate capability. Great for small operations to get something working. But Squarespace still does the heavy lifting. Somebody needs to.

Thanks for all the replies, those are really insightful.

When you design a system, you make technical choices. Many of those are compromises between speed and robustness. A given database system will work correctly when you grow from 10000 to 40000 transactions a second, but at 43017 transactions it will suddenly not have enough memory and revert to a mode where things are swapped to disk, and suddently the database throughput will plummet and the website will grind to a halt. In a sufficiently complex system, this may not be testable in advance.

This blog entry by Twitter (written before the current debacle – I suggest you read it soon), gives an idea of the kind of re-architecture and re-engineering that needs to be done just to ensure a high-capacity site adapts to ever-increasing demand. The diagrams are very high-level, I assure you each of the illustrated boxes has a few other diagrams of its own.

Amazon is in a class by itself because they handle physical stuff and a complex logistical chain. To us, it’s just a site where we click and a driver shows up the next day. But their systems need to provide for suppliers, site customers, warehouse employees, delivery (both staff and contractors), FBA merchants, help desk staff, credit card payment processors, fraud detection staff, data center employees. Oh, and also Prime Video and Prime Music stuff. Each of these stakeholders or participants has their own collection of web pages that provide different services (an FBA merchant in Canada needs a page to generate sales tax reports that take into account the tax rates and policies of each province; a support staff person needs a page to reset the passwords of a user in Germany who is covered by GDPR; a warehouse employee needs a page to enter that the box they just received from a supplier contains 15 beanie babies, and not 16 as declared). All of that can suddenly need to handle more requests if there’s a pandemic and everybody needs to buy toilet paper or an actress dies and everybody wants to see her movies. Oh, and please make sure the inventory shown on screen (across hundreds of data centers) accurately reflects what’s available in the hundreds of warehouses.

The original Amazon store that sold books and then other stuff eventually needed the kind of re-architecting described above, just to keep going and growing. They decided to create a flexible infrastructure for virtual processing and storage across data centers, which eventually became Amazon Web Services and was offered as cloud infrastructure to other companies. AWS is now a significant part of Amazon’s revenue and represents something like 74% of Amazon’s profit.