Deterministic message-passing : why isn't this software engineering pattern used more often?

So I work at a company that deploys embedded systems. These systems are produced in millions of units. They fail sometimes. And it costs inordinate amounts of time to hunt down the faults - often several months, where several well compensated developers are working on the problem basically full time. (even worse, when the faults get really serious, executives will schedule daily meetings on the problem, that can drag on for months, and that can’t be cheap)

So I read how SpaceX uses this pattern. It’s a pretty simple idea.

a. Divide your software into isolated modules. (they must be isolated in memory, OS isolation is optimal)
b. Each module cannot retain any state, except for modules who’s role is to do nothing but.
c. Each module can only communicate with other modules through “messages”. (they can be implemented efficient ways but the message framework must guarantee messages can’t be corrupted by the sender)
d. Each module must be completely deterministic - a given set of input messages has one and only one correct set of output messages. This notably means you have to set the FPU to IEEE mode, etc…

Why do this? Because the message passing framework has another purpose. Any messages can be copied as they are sent. Or, later, reinjected. This means if a system fails when it’s running, and you have it in debug mode saving all input messages (hardware has to be specced fast enough to make this feasible), bugs can always be reproduced. (due to the determinism property)

Almost of the system can also be unit tested as well, except for very low level code that’s role is to touch hardware registers but should use the messaging framework…

A second major advantage is that unit testing is a pain in the ass in low level languages. If the system uses messages, you can build a test framework where the tests are written in a scripting language, such as Ruby or Python. Furthermore, you can do better than just plain unit tests - you can originally implement each algorithm in Python/Ruby/Matlab. (due to using a message passing framework that is language agnostic, there is no issue with part of the codebase already being in a lower level language).

Then, once the algorithm is proven correct enough, reimplement the algorithm in a faster programming language, and test it against a massive saved pile of input messages and insure it gives the same results. Readily automatable as a regression test as well.

Memory is separate from business logic. This would also prevent a great many of the memory corruption bugs from low level programming, because to save any information, you have to pass a message to some other system to do it. That other system can be written properly to not leak memory…

Maybe I need to work at a different company. Where I work at, it’s all about right now, and rarely about any sort of method that might prevent the current software crisis.

This works okay if the number of messages is modest and if things can be made completely deterministic.

For things that require a large number of data changes/second, logging messages is going to be hell. It will impact the behavior of the system. Possibly introducing problems or, worse, making the error temporarily disappear.

And a lot of software situations just cannot be made deterministic and repeatable. A lot.

It is extremely common that extra code to assist debugging just makes tracking down a small percentage of the bugs even harder.

I think OP’s prescription — well-defined modules and messaging protocols, minimal saved state — are good ideas for most software applications. However I also think ftg is correct:

OP’s approach may cope well with simple bugs. But it’s complex bugs, especially those associated with asynchronous events, that are difficult to solve.

Back in the late Jurassic Era I was involved with a real-time programming project. Someone taught me an approach to real-time priority-driven software that avoids many races, and has other advantages. Ping me when you start a real-time programming thread in IMHO and I’ll outline the approach.