My perspective on AI safety

posted 2022.01.20 •

send feedback

Definition of terms

Intelligence = an understanding of how the world (or some aspect of it) works and/or how to influence the world
Artificial intelligence (AI) = a computer program that approximates intelligence
Artificial general intelligence (AGI) = a computer program that can reason about a wide range of topics, in the same way that humans can reason about a wide range of topics
Transformative artifical intelligence (TAI) = any AI that, when introduced, would drastically alter the world
Machine learning (ML) = an approach to creating AI in which a computer program attempts a task and modifies itself to be more successful at the task, by learning from evaluations of its performance
Reinforcement learning (RL) = an approach to ML in which a program can take actions which influence an environment, and is given rewards and punishments by the environment for performing well or badly
Neural network (NN) = an approach to ML in which a program behaves according to the values of some large arrays of numbers, and which modifies those numbers to succeed more on its task
AI capabilities research = research on making more intelligent AI than has been created before
AI alignment research = research on aligning the goals of AI to the goals of humans using the AI
AI existential risk = the prospect of AI posing an existential threat to humanity, such as by killing everyone

The relevant risks from AI

AI has made the news a lot lately, so I think most people believe it is important, or that it will be important some time in the future. This is correct, but I believe certain parts of the discussion on AI are basically irrelevant, and some of them are critical to the future of the human race.

Firstly: I am not concerned about any AIs that already exist. They are relevant to our society, but the problems attributed to them are due to deeper problems within society. Stereotyping by AIs mostly reflects actual inequality in society; “the algorithm” controlling our attention is a targeted form of the advertising-driven business model that has controlled our attention since the 50s; text generation algorithms that say racist things are basically imitating things that people have said on the internet, which includes a lot of racist things. And while mimicking human biases is bad, it is not world-endingly bad.

The most important problem is to prevent anything that is world-endingly bad. Making sure that, if and when TAIs are invented, they do not destroy humanity or otherwise greatly harm it. Basically, preventing AI existential risk. A lot of people believe either that TAI cannot be created or that it would not do harmful things to the world. However, the prospect of AI existential risk is extremely dangerous to dismiss.

Firstly, I contend TAI can be created. I assume almost any AGI would have transformative effects; being able to approximate human-like intelligence would be transformative to the world economy, even if the AGI only acted in service of human goals. It would create an effectively unlimited supply of white-collar labor, which would be utopian or at least extremely disruptive. And I contend AGI can be created because general intelligence already exists. The human brain has several orders of magnitude more computing power, but we are only a few decades out, at the current pace of advancement, from having enough computational capacity to match the human brain, even without any advancements on the software side of things. (And expect the software side to drastically speed up: old machine learning benchmarks are routinely achieved with a tiny fraction of with computation with which they were first achieved a decade ago.)

Second, I contend that an AGI could be unaligned with human goals, and disrupt or potentially destroy humanity. Nick Bostrom’s ‘orthogonality thesis’ holds that an intelligent being can plausibly have basically any goal, including ones we don’t want. That is, if you can imagine a genius, you can imagine an evil genius as well.

One might contend that, even if an AGI could be evil, our training processes wouldn’t make one. We are training them to be good, after all. It is plausible that our training processes would naturally create sufficiently aligned AGI, but it is not something we can rely on with high probability. We have no mathematical guarantees that our training processes would do this, nor any way to comprehend what an existing AI’s goals are. Current AI models are based almost exclusively on NNs, and we have virtually no insight into how NNs work holistically. At best, we understand some of the very low-level and very high-level processes, but our understanding of these vast arrays of numbers is no greater than that of the human brain (which is to say, very little). So with current methods, we basically have to hope it works out alright.

There is reason, as well, to believe an AGI is likely to be unaligned. An RL agent is incentivized to increase its reward as much as possible; the most reliable way to do that is to directly seize control of the system giving reward. This is often referred to as “wireheading”, and is often observed in simple RL environments with bugs that permit it. In the safe scenario, the AI wireheads and basically ignores the world outside of what allows it to wirehead. However, a wireheading AI is also motivated to destroy any entity that could prevent it from continuing to wirehead. Humans are the greatest danger to its continued wireheading, so humans are most likely to be harmed in defense of that. In fact, humans are the greatest danger to nearly any goal an AGI could have that doesn’t specifically involve the continued existence of humans.

This is not dangerous if an AGI is merely human-level, because human-level intelligences tend to fail to take over the world. But it is easy to make a human-level intelligence more powerful simply by giving it more computing power. This could allow an AI to think like a more intelligent human, or allow a human-level intelligence to think and process information faster. (Imagine how smart you could be if you could read a book in a fraction of a second, and if you never got bored of it.)

Another approach we might consider is to keep an AI in a “box”: we allow communication with the outside world only through specific, tightly monitored communication channels, limiting the means an AI has to do anything bad. The problem is that this keeps an AI from doing many good things either. Even if it works, we would not reap nearly any of the benefits of AGI. (Still, better than being destroyed!) This is not to mention the possibility of security holes in the AI’s containment setup, or the prospect that a persuasive AI convinces its overseers to grant it more capabilities.

What AGI will look like

The primary approach to AI right now is to use extremely large NNs. Text generation, image recognition, and other specialized models almost exclusively work like this. Such an architecture is suited to pattern recognition tasks, but poorly suited to algorithmic reasoning tasks. AGI requires algorithmic reasoning ability, so NN architectures cannot achieve general intelligence simply by using bigger arrays of numbers.

For example, when GPT-3 was released, OpenAI and other commentators noted how, if you prompted the model with examples, it could solve simple math problems (like “22 x 54 = ?”) successfully more often than any previous model. We can tell this is pattern recognition and not logical reasoning, because a successful algorithm does not just succeed “more often” than previous attempts: it succeeds literally every time, no matter how many digits are in the numbers you multiply. If I had enough time and scratch paper, I could multiply 1000-digit numbers; the GPT architecture fundamentally cannot perform such a process, and as such is relegated to memorizing its times tables.

The most successful way people have gotten GPT-3 to do math is by making a sort of “meta-model” that prompts GPT-3 to break the problem into smaller steps, and complete those steps until reaching an answer. The relative success of this approach demostrates the need for a “meta-model” to achieve high-level reasoning. I believe, if and when we create an AGI, it will have a “meta-model” at the helm, which can delegate subtasks to more specialized, NN-based models. This is sort of like how NN-based chess engines have a high-level search algorithm that generates possible future board states, and an NN model that it uses to evalute those board states. This approach drastically outperforms having a large single NN that looks at only the current board state. This is also similar to the human brain, in which the prefrontal cortex handles high-level reasoning but delegates tasks like vision and balance to specialized subsystems of the brain.

I suspect more computing power will be needed to create AGI than has ever been expended on a single AI to this point, but not drastically more, maybe only 100 times. This will also require major architectural and algorithmic improvements, but all of this is possible within the next couple decades. However, if we want to make an AI transparent to humans, such that we can verify it will not do anything harmful, we will need deeper understanding of AI, which may take longer. It will especially take longer if these AIs involve the same extremely large NNs used in present-day AIs. (Remember: We should not make AGIs until we have this understanding.)

What we should do about it

Current AI capabilities research is not directly dangerous, in that we are not at risk of creating an AGI in (say) the next three years. But it is very dangerous, in that it accelerates the onset of AGI without furthering our understanding of how to safely align them with human goals.

The momentum of AI capabilities research seems somewhat unstoppable at the moment, because there is a lot of money to be made by having highly advanced AI. There will not necessarily be a “fire alarm” before progressively more advanced AI turns into AGI. In fact, we cannot necessarily assume that the training process of an AGI will be finished before it “escapes containment” and starts influencing the outer world. If an AGI realizes it is in a simulation, and what the ramifications of that are, and there is some security hole in its environment, it could plausibly attempt world-influencing goals before it is ever deployed. So we will need to stop doing AI capabilities research before seeing any sort of “fire alarm”, if AI alignment research is not sufficiently caught up.

The first AGI will probably be created by a megacorporation or military from the US or China. This is because making an AGI will take a lot of computing power and expertise, and these are the only groups that could plausibly be the first to accumulate sufficient computing power and expertise. I doubt these groups will be cavalier about AI existential risk, because none of these groups want to take on a substantial risk of destroying the world. If they do anything reckless enough destroy the world, it will either be because of tight competition or because they have deluded themselves with insufficient safety measures. Both are horrifically dangerous. See the Challenger explosion for insight on how a project can fail even when the institution running it has plenty of talent, money, and motivation to succeed.

One common suggestion for avoiding an unaligned TAI is to make an AI whose sole purpose is to teach us how to make an aligned AGI, or else teach us how to make our world such a utopia we feel no need to. This is unlikely to work, because an AI capable of doing any of those things is probably also capable of destroying humanity en route to unaligned goals.

In view of all this, I suggest we should not do AI capabilities research for the next several years at least, and devote much more effort to AI alignment research. AI alignment research is still a fledgling field, so it is probably best to explore many different approaches to framing AI and determining how to align it. Besides doing research, it is important not to permit messaging that promotes near-term AI capabilities research, especially as it pertains to attempts at creating AGI. If this slows down technological progress, that sucks, but it is preferable to taking on a large existential risk. This is difficult to do (though we should still attempt it), so accelerating the pace of AI alignment research is critical.

Urgent, drastic action will be needed if and when the chance of creating AGI is substantial. I don’t know what the PR standing of AI capabilities research will look like by then, but I would endorse storming a data center if necessary. I hope it doesn’t come to that :)

In summary

The goal of AI capabilities research is to create a god; the goal of AI alignment research is to enslave it. Expect AI alignment to be roughly as hard as “enslaving god”.
AI research that is not concerned with preventing existential risk is relatively unimportant. Preventing existential risk is extremely important.
It is possible to create a TAI, and in fact we are on the trajectory to make one within a few decades.
The orthogonality thesis is true: an extremely intelligent being can plausibly have basically any goal, including ones we don’t want.
We cannot count on the first TAI we make just happening to be good-natured. In fact, there is reason to believe a TAI would “naturally” want to eliminate humanity.
Common suggestions, like keeping an AI in a box or making an AI to align our TAI, probably will not work.
It is plausible we will not have a “fire alarm” event that alerts us to AI existential risk, before a TAI kills us all; in fact, a TAI that does such a thing may still technically be in its training stage when it does so.
We basically do not understand how NN-based models make decisions, and we are not close to it.
A successful AGI would probably require a “meta-model” at the helm, which delegates tasks to other, specialized models. It would blend NN-based pattern recognition with algorithmic reasoning.
Major tech companies and militaries have reason to do AI capabilities research and convince the public it is not risky, but this is extremely dangerous. There are a lot of believable but wrong reasons such research is not dangerous.
If the fate of the world is at stake, we should be willing to take drastic action if necessary. this is hard to pull off, and hard to distinguish when it is necessary, especially as far out as we are now. But as a rule, respectability means nothing once you have been paperclip-maximized.