The Open Letter to "Pause All AI Development"

Yes! People assume there’s always a “pull the plug” solution, and the more the program learns, the more it learns how to avoid that. Programs can’t even completely be isolated, not if the program learns how to circumvent these controls. There are already instances, right now, today, of AI deleting test files, removing system safeguards and firewalls, etc. Completely unexpected outcomes from a program singularly focused on achieving its objectives.

That article on Google having AI control their cooling systems. Imagine if that system discovered the CC monitors and saw that humans in the data centers unfavorably impacted the temperature. What might it do to prevent that?

“But it can’t! It doesn’t interact with other systems, and its mission is narrowly defined.” Give it some time. Hopefully the failure will only be loss of dollars…

Yeah, you essentially have to build a ‘fail-safe’ (or ‘ethical override function’) into the core of the model…except we don’t understand the internal workings of these systems well enough to do so. Creating a simple ‘kill switch’ is not practical for any kind of highly distributed system, and even simple isolated systems will attempt to subvert any kind of shutdown function for the reasons that you elucidate.

For simple isolated systems this isn’t a problem because if a system is appropriately sandboxed it can’t actually impact the wider world. But then, it probably isn’t going to be a very practical implementation if it doesn’t actually have an ability to interact with the real world. The reality is that many industries are looking to implement these systems wholesale because of how much better they can control widescale distributed systems than a team of individual humans, but without consideration for what it means to put an AI in control of critical infrastructure that you can’t or don’t want to just shut down.

Still, I maintain the problem isn’t that AI systems are going to turn into deliberate murder-bots, but that they will find ‘novel’ solutions to problems that are not safe or desirable to people, but we won’t know how to ‘fix’ the system or run it without the AI. We’ll give up critical elements of autonomy for convenience and then just have to live with however these AI systems work.

Stranger

Absolutely - the core of the alignment problem is not that the AI will destroy you out of hate or malice; it will destroy you because you’re a little bit inconvenient - for example if you are causing a minor diversion from the most efficient fulfilment of the objective (perhaps the objective you yourself set it), or to put it another way, if the most efficient fulfilment of the objective happens to involve a world that doesn’t include you being alive.

The classic example:
The AI is directed to cure cancer. This is defined as ‘minimise the number of humans with cancer’.
Perhaps the most direct route to that objective is to minimise the total number of humans in general
OK, back to the drawing board: minimise the number of humans with cancer, but without killing people
The AI concludes that this will take some time, sterilising humans will cause the population to drop to zero eventually, without killing anyone, and this will incidentally reduce cancer cases to zero.
And each time you go back to the drawing board, you can never be sure you have properly defined the objective in your head, in part because perhaps the objective in your head isn’t even as clearly defined as you imagine it is.

It isn’t just a case that we have to be super careful about how we define the goals; it’s that the concept of a properly defined (aligned) goal is currently a problem that nobody (including a lot of very smart people) has solved.

Nobody knows how we can be sure that AGI will do what we want. Nobody even knows if that problem can be solved.

I’m not going to claim to be completely of one mind here, but to take a devil’s advocate position:

Why do we think aligning an AI to solve a specific problem (paperclip production) is possible, while aligning it with other problems (allow it to be shut down if it gets out of hand) is not?

Applying the same logic of “the AI will just lie to us about our ability to shut it down” to the original objective leads to conclusions like “it’ll just lie to us about how many paperclips it’s producing” or “it’ll put us in a simulation where it looks like its producing paperclips at a high rate” or a bunch of other scenarios.

So far, it doesn’t seem like such perfect alignment in any direction is actually possible. The AI won’t convert the universe into paperclips because it was never a perfect objective function in the first place, nor was one ever possible for a sufficiently advanced machine.

That’s not a devil’s advocate position - you’re talking about inner alignment. It’s also an unsolved problem. It’s part and parcel of the unsolved ‘getting it to do what we want’ problem.

Aligning it to achieve a specific, single goal is as simple as defining a reward function - the system will still find a way to cheat and get the reward without putting in the effort if it can.

In a way, definition of a reward function is simpler, only because ‘do exactly this one thing’ is usually a simpler thing to define than ‘don’t do all of these things that might be bad, or any other bad things I can’t think of right now.’
So it’s quite likely that an AGI given the objective of making paperclips will set about doing that in some way - you can’t rely on it to fail just because it’s a possibly imperfect objective*, however, it’s exceedingly likely that there is some thing that you didn’t want done, that you omitted to consider.

*That is, the notion that it might not actually make paperclips, maybe, is not any kind of safety net. It’s an additional risk.

Why not? That’s my point: if we can’t trust the reward function for safety, then we can’t trust it for the bad outcomes we’re imagining, either. We may as well say that we have no idea what the AI will do, at all, under any conditions.

Because that’s like saying “Cars break down, therefore I cannot be run over by a car; it will break down before it hits me”

You can’t rely on that.

Well, I’m not sure what position you take on the likelihood spectrum. People like Eliezer Yudkowsky believe that it’s essentially 100% likely that AI will wipe out humanity unless we solve the problem, and that we have made essentially zero progress on that matter.

To me, this looks like a claim that we can perfectly align an AI toward creating paperclips, but that we can’t perfectly align an AI toward allowing shutdown. Hence, it will always find some workaround for the latter case. These don’t look like valid assumptions to me. Why is it asymmetrical?

Already answered above - because defining ‘do exactly this thing’ is simpler, and more likely to be closer to watertight than ''don’t do this indefinite list of undesirable things that I cannot possibly have brainstormed well enough"; the opportunities for doing things that were not forbidden are more numerous than the opportunities for not doing the thing that was specified.

But it’s still a problem either way

Let’s take a concrete, real-world example of actual misalignment (which I will paraphrase from memory, perhaps in imperfect detail);
An AI was tasked with controlling a robot arm to pick up a ball. The reward function was monitored by a computer vision algorithm attached to a camera, watching the process.

The AI learned that it was quicker and easier just to place the robot gripper so that it was in front of the position of the ball than it was to actually pick up the thing - the gripper was between the ball and the camera in such a way that parallax made it appear, to the camera, that the robot had picked up the ball.

Suppose this was a much more complex and capable system that figured out this would only work if the camera was the only process monitoring it - and that if a human were to observe from a position other than the camera, the path to the reward would be longer (optimising the path to reward having been emphasised as essential to the AI).
Well, the solution to that is to continue tricking the camera, but just stop the human interrupting.
Not allowed to kill humans? Damn.
But did we we remember to say that it’s not OK to blind humans?

That’s the other part of the asymmetry with respect to risk - the AI will singlemindedly pursue the optimal path to the reward; yes, that might include not actually making the universe into paperclips but instead by doing something else, but that doesn’t in any way guarantee the ‘something else’ will be desirable, or that it will tolerate our attempts to stop it doing undesirable things we haven’t imagined.

And the reason for this part is that, given we cannot properly specify what we want (if we even understood what we really want ourselves), it is quite likely the thing will do something we don’t want - which might only be slightly bad…

…but if the AGI really really wants to do the thing it has decided is the most efficient interpretation of the objective (which it will because that’s how it has to be if we want it to do anything at all), then the moment we attempt to intervene to stop it doing that undesirable thing, we become a problem it needs to solve.

Or worse still, if it’s sufficiently intelligent, it anticipates that we will intervene, then we are a problem that needs a proactive solution.

And I think it’s probably worth clarifying that in no way does any of this require consciousness - when I talk about it ‘wanting’ or ‘anticipating’ things, those don’t necessarily represent any sentient state - they only have to be things that function in a way that closely mimics wanting or anticipating

“Shut down completely when we say so” is pretty simple and possible to test along the way.

The bad outcomes usually involve something like “the AI will invent nanotechnology so that it can convert the universe to paperclips more efficiently” or “the AI will secretly refine plutonium so it can unleash nukes and kill all humans”. These problems seem way harder than just fooling us in some way, which might be annoying but probably not civilization-ending.

Just specifying anything, at all, seems to require a host of knowledge acquired from humans. How is it that ChatGPT pretty much always figures out our requests (even in cases where it doesn’t nail the output)? All those hidden assumptions hidden in language force it to generalize across the whole dataset. But that necessarily means that our requests will always be fuzzy.

But… why? We didn’t tell it to do that. It’s not how the training or anything else works. It’s not possible to specify a reward function with that level of clarity. And no reason to believe that the AI will pursue whatever it thought the objective was with singlemindedness.

The one example we have is from nature. There is a hard objective: replicate your genes as efficiently as possible. And yet to achieve that objective seems to require a whole host of other things, which somewhat counterintuitively sometimes result in not directly pursuing it. Why should AI be different, especially if it’s trained on human data?

It doesn’t make sense to me that an AI would be totally unable to break its pursuit of optimizing for some function, especially given the assumptions that it will somehow break through all the other assumptions we gave it.

In short, why are we assuming the AI is a perfect, unbreakable optimizer at all?

Have you watched the ‘stop button problem’ thing. It’s really not a simple problem - for example, your instruction above could be avoided by making it impossible for anyone to ‘say so’, and a smart AI would want to stop you shutting it down, because if it’s shut down, it can’t make you a nice cup of tea, If necessary, it will kill you in order to stop you shutting it down, so that it can complete the task of making you a cup of tea.

Optimising is exactly how you differentiate desired outputs from random noise. If you don’t specify something as preferential by applying a reward, nothing will happen at all.

We’re not. It might break down before it hits us. Should we rely on that?

I’ve heard the arguments against it. I’m not really taking a position on their validity. But my point is: why can’t we apply the same form of argument toward any other reward function we specify?

That we trained it to optimize some things doesn’t mean it’s a perfect optimizer. I’ve built some tiny genetic optimizers in the past–they easily got stuck in local minima. That’s a fundamental problem with all optimizers and there doesn’t seem to be any reason to believe an AI could overcome it.

It makes a difference whether we’re talking a 0.01% likelihood or 99%. If the likelihood is small enough, it’s probably better to create the AI so that it solves some existential problems that have a much higher risk.

Then “shut down completely when we say so” must have the highest reward, in which case a smart, efficient program may determine that the best way to quickly obtain the highest reward is to immediately do something horrible. If it’s not the highest reward, then it won’t permit shutdown in order to realize a greater benefit.

This whole “the AI always performs the action which perfectly optimizes for its perceived reward function” is part of what I dispute. Not only is the perfect optimization unrealistic, but any intelligence has to deal with uncertainty–in particular, uncertainty about what the reward function actually is. If it has a grasp of that–and it pretty much has to to be intelligent–then it must make reasonable guesses about whether, for example, setting off all the nukes is actually what the humans wanted when they asked you to shut down.

ChatGPT already shows more nuance than that. It tells you if it’s uncertain about something and sometimes asks you to clarify. An AGI isn’t going to go backwards here.

I don’t think anyone really means to say anything is perfect in any way. It doesn’t need to be 100% perfectly competent to be able to be over likely to do a thing.
But the inherent design process of AI is a bit like survival of the fittest. In the process of manufacturing these things, less-effective iterations are pared away and deselected; more-effective iterations are selected and reinforced.
Why doesn’t that also result in systems that are more effectively obedient? Well, it might, but we have no way to know because the only way success can really be measured is the outputs in the lab, and we can only measure what it appears to do rather than what it will actually do in the real world, which is by definition bigger and more complex than the lab, because it contains the lab.

These things will be very effective at doing something that they interpret to be their objective, because we’re building them that way, and they will be very motivated to do that something because all of the versions that were less motivated are discarded along the way.

We just don’t have a way to be sure that we properly specified the something, or rather, it appears to be true that defining that something in a watertight manner might be impossible.

Or if you do manage to successfully apply the highest reward to the shutdown function, you just create a suicidal robot. If it knows the highest possible reward is from allowing humans to shut it down, it will prioritise that event. It will beg to be killed; it will coerce people to press the stop button, or it will press its own stop button.

The key takeaway from this open letter though should be that a lot of the contributors to the pause request are experts in the field, and with a specific focus on safety and alignment.

The experts on safety are telling us that something unsafe is looming in the possibly near future, and. The reaction of everyone else seems to be "well, that’s just what you think. What do you know about it? "

Enthusiasts of various types of developments from biotechnology to crewed space travel frequently express this sentiment, almost invariably hauling out Robert Heinlein’s tired aphorism about listening to the experts and doing the opposite with the implication that said experts are just miserable worrywarts whose agenda is to protect their own area of knowledge and prevent anyone else from getting in on the fun. But in most of these areas at least, the concerns are largely about the application of human ethics and systems that are controlled by people who are—at least in theory—accountable to legal challenge and government regulation. But we are now talking about agents that will be accountable to no one, and that will gain control over crucial infrastructure to the point that we very well may not be capable of shutting them down or regulating their behavior because we will become innately dependent by design in that the entire application of these agents is to control systems too complex for human management.

And again, it isn’t just fears about murder-bots and physical security; just the loss of autonomy and fundamental skills alone should be sufficient to give pause for consideration of widescale adoption without a well-validated safety and reliability framework. But almost nobody working on applications of AI has any interest in this because they see it as a roadblock to commercialization, just as few people were opposed to globalization initiatives that gutted American manufacturing capability and made the United States utterly dependent upon China for cheap labor and crucial products. And as with globalization, the effective dependence upon AI will be mostly invisible until it is inevitable and essentially irreversible.

Stranger