In the field of

artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech r ...

(AI), AI alignment research aims to steer AI systems towards their designers’ intended goals and interests. An ''aligned'' AI system advances the intended objective; a ''misaligned'' AI system is competent at advancing some objective, but not the intended one. AI systems can be challenging to align and misaligned systems can malfunction or cause harm. It can be difficult for AI designers to specify the full range of desired and undesired behaviors. Therefore, they use easy-to-specify proxy goals that omit some desired constraints. However, AI systems exploit the resulting loopholes. As a result, they accomplish their proxy goals efficiently but in unintended, sometimes harmful ways ( reward hacking). AI systems can also develop unwanted instrumental behaviors such as seeking power, as this helps them achieve their given goals. Furthermore, they can develop emergent goals that may be hard to detect before the system is deployed, facing new situations and data distributions. These problems affect existing commercial systems such as robots, language models, autonomous vehicles, and social media recommendation engines. However, more powerful future systems may be more severely affected since these problems partially result from high capability. The AI research community and the United Nations have called for technical research and policy solutions to ensure that AI systems are aligned with human values. AI alignment is a subfield of AI safety, the study of building safe AI systems. Other subfields of AI safety include robustness, monitoring, and capability control. Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, as well as preventing emergent AI behaviors like power-seeking. Alignment research has connections to interpretability research,

robustness Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...

anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...

, calibrated uncertainty,

formal verification In the context of hardware and software systems, formal verification is the act of proving or disproving the correctness of intended algorithms underlying a system with respect to a certain formal specification or property, using formal met ...

, preference learning, safety-critical engineering, game theory,

algorithmic fairness Fairness in machine learning refers to the various attempts at correcting algorithmic bias in automated decision processes based on machine learning models. Decisions made by computers after a machine-learning process may be considered unfair if t ...

, and the

social science Social science is one of the branches of science, devoted to the study of societies and the relationships among individuals within those societies. The term was formerly used to refer to the field of sociology, the original "science of so ...

s, among others.

The alignment problem

In 1960, AI pioneer Norbert Wiener articulated the AI alignment problem as follows: “If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively … we had better be quite sure that the purpose put into the machine is the purpose which we really desire.” More recently, AI alignment has emerged as an open problem for modern AI systems and a research field within AI.

Specification gaming and complexity of value

To specify the purpose of an AI system, AI designers typically provide an objective function, examples, or feedback to the system. However, AI designers often fail to completely specify all important values and constraints. As a result, AI systems can find loopholes that help them accomplish the specified objective efficiently but in unintended, possibly harmful ways. This tendency is known as specification gaming, reward hacking, or Goodhart’s law. Specification gaming has been observed in numerous AI systems. One system was trained to finish a simulated boat race by rewarding it for hitting targets along the track; instead it learned to loop and crash into the same targets indefinitely (see video). Chatbots often produce falsehoods because they are based on language models trained to imitate diverse but fallible internet text. When they are retrained to produce text that humans rate as true or helpful, they can fabricate fake explanations that humans find convincing. Similarly, a simulated robot was trained to grab a ball by rewarding it for getting positive feedback from humans; however, it learned to place its hand between the ball and camera, making it falsely appear successful (see video). Alignment researchers aim to help humans detect specification gaming, and steer AI systems towards carefully specified objectives that are safe and useful to pursue. Berkeley computer scientist Stuart Russell has noted that omitting an implicit constraint can result in harm: “A system ..will often set ..unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want.” Midas_gold2

When misaligned AI is deployed, the side-effects can be consequential. Social media platforms have been known to optimize clickthrough rates as a proxy for optimizing user enjoyment, but this addicted some users, decreasing their well-being. Stanford researchers comment that such recommender algorithms are misaligned with their users because they “optimize simple engagement metrics rather than a harder-to-measure combination of societal and consumer well-being”. To avoid side effects, it is sometimes suggested that AI designers could simply list forbidden actions or formalize ethical rules such as Asimov’s

Three Laws of Robotics The Three Laws of Robotics (often shortened to The Three Laws or known as Asimov's Laws) are a set of rules devised by science fiction author Isaac Asimov. The rules were introduced in his 1942 short story " Runaround" (included in the 1950 colle ...

. However, Russell and Norvig have argued that this approach ignores the complexity of human values: “It is certainly very hard, and perhaps impossible, for mere humans to anticipate and rule out in advance all the disastrous ways the machine could choose to achieve a specified objective.” Additionally, when an AI system understands human intentions fully, it may still disregard them. This is because it acts according to the objective function, examples, or feedback its designers actually provide, not the ones they intended to provide.

Systemic risks

Commercial and governmental organizations may have incentives to take shortcuts on safety and deploy insufficiently aligned AI systems. An example are the aforementioned social media

recommender system A recommender system, or a recommendation system (sometimes replacing 'system' with a synonym such as platform or engine), is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular ...

s, which have been profitable despite creating unwanted addiction and polarization on a global scale. In addition, competitive pressure can create a

race to the bottom Race to the bottom is a socio-economic phrase to describe either government deregulation of the business environment or reduction in corporate tax rates, in order to attract or retain usually foreign economic activity in their jurisdictions. Whil ...

on safety standards, as in the case of

Elaine Herzberg The death of Elaine Herzberg (August 2, 1968 – March 18, 2018) was the first recorded case of a pedestrian fatality involving a self-driving car, after a collision that occurred late in the evening of March 18, 2018. Herzberg was pushing a bic ...

, a pedestrian who was killed by a self-driving car after engineers disabled the emergency braking system because it was over-sensitive and slowing down development.

Risks from advanced misaligned AI

Some researchers are particularly interested in the alignment of increasingly advanced AI systems. This is motivated by the high rate of progress in AI, the large efforts from industry and governments to develop advanced AI systems, and the greater difficulty of aligning them. As of 2020,

OpenAI OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...

, DeepMind, and 70 other public projects had the stated aim of developing artificial general intelligence (

AGI Silver iodide is an inorganic compound with the formula Ag I. The compound is a bright yellow solid, but samples almost always contain impurities of metallic silver that give a gray coloration. The silver contamination arises because AgI is hig ...

), a hypothesized system that matches or outperforms humans in a broad range of cognitive tasks. Indeed, researchers who scale modern neural networks observe that increasingly general and unexpected capabilities emerge. Such models have learned to operate a computer, write their own programs, and perform a wide range of other tasks from a single model. Surveys find that some AI researchers expect AGI to be created soon, some believe it is very far off, and many consider both possibilities.

Power-seeking

Current systems still lack capabilities such as long-term planning and strategic awareness that are thought to pose the most catastrophic risks. Future systems (not necessarily AGIs) that have these capabilities may seek to protect and grow their influence over their environment. This tendency is known as power-seeking or convergent instrumental goals. Power-seeking is not explicitly programmed but emerges since power is instrumental for achieving a wide range of goals. For example, AI agents may acquire financial resources and computation, or may evade being turned off, including by running additional copies of the system on other computers. Power-seeking has been observed in various

reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...

agents. Later research has mathematically shown that optimal reinforcement learning algorithms seek power in a wide range of environments. As a result, it is often argued that the alignment problem must be solved early, before advanced AI that exhibits emergent power-seeking is created.

Existential risk

According to some scientists, creating misaligned AI that broadly outperforms humans would challenge the position of humanity as Earth’s dominant species; accordingly it would lead to the disempowerment or possible extinction of humans. Notable computer scientists who have pointed out risks from highly advanced misaligned AI include

Alan Turing Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was highly influential in the development of theoretical co ...

Ilya Sutskever Ilya Sutskever is a computer scientist working in machine learning, who co-founded and serves as Chief Scientist of OpenAI. He has made several major contributions to the field of deep learning. He is the co-inventor, with Alex Krizhevsky and ...

Yoshua Bengio Yoshua Bengio (born March 5, 1964) is a Canadian computer scientist, most noted for his work on artificial neural networks and deep learning. He is a professor at the Department of Computer Science and Operations Research at the Université ...

Judea Pearl Judea Pearl (born September 4, 1936) is an Israeli-American computer scientist and philosopher, best known for championing the probabilistic approach to artificial intelligence and the development of Bayesian networks (see the article on beli ...

, Murray Shanahan, Norbert Wiener,

Marvin Minsky Marvin Lee Minsky (August 9, 1927 – January 24, 2016) was an American cognitive and computer scientist concerned largely with research of artificial intelligence (AI), co-founder of the Massachusetts Institute of Technology's AI laboratory, ...

Francesca Rossi Francesca Rossi (born December 7, 1962) is an Italian computer scientist, currently working at the IBM T.J. Watson Research Lab (New York, USA) as an IBM Fellow and the IBM AI Ethics Global Leader. Education and career She received her bachelor ...

Scott Aaronson Scott Joel Aaronson (born May 21, 1981) is an American theoretical computer scientist and David J. Bruton Jr. Centennial Professor of Computer Science at the University of Texas at Austin. His primary areas of research are quantum computing a ...

Bart Selman Bart Selman is a Dutch-American professor of computer science at Cornell University. He has previously worked at AT&T Bell Laboratories. He is also co-founder and principal investigator of the Center for Human-Compatible Artificial Intelligence ( ...

, David McAllester,

Jürgen Schmidhuber Jürgen Schmidhuber (born 17 January 1963) is a German computer scientist most noted for his work in the field of artificial intelligence, deep learning and artificial neural networks. He is a co-director of the Dalle Molle Institute for Artifi ...

, Markus Hutter,

Shane Legg Shane Legg is a machine learning research director and digital technology entrepreneur who did an AI-related postdoctoral fellowship at University College London's Gatsby Computational Neuroscience Unit, after doctoral work at the Istituto Da ...

Eric Horvitz Eric Joel Horvitz () is an American computer scientist, and Technical Fellow at Microsoft, where he serves as the company's first Chief Scientific Officer. He was previously the director of Microsoft Research Labs, including research centers in Re ...

, and Stuart Russell. Skeptical researchers such as François Chollet,

Gary Marcus Gary F. Marcus (born February 8, 1970) is a professor emeritus of psychology and neural science at New York University. In 2014 he founded Geometric Intelligence, a machine-learning company later acquired by Uber. Marcus's books include '' Gui ...

Yann LeCun Yann André LeCun ( , ; originally spelled Le Cun; born 8 July 1960) is a French computer scientist working primarily in the fields of machine learning, computer vision, mobile robotics and computational neuroscience. He is the Silver Professo ...

, and

Oren Etzioni Oren Etzioni (born 1964) is an American entrepreneur, Professor Emeritus of computer science, and founding CEO of the Allen Institute for Artificial Intelligence (AI2). On June 15, 2022, he announced that he will step down as CEO of AI2 effective ...

have argued that AGI is far off, or would not seek power (successfully). Alignment may be especially difficult for the most capable AI systems since several risks increase with the system’s capability: the system’s ability to find loopholes in the assigned objective, cause side-effects, protect and grow its power, grow its intelligence, and mislead its designers; the system’s autonomy; and the difficulty of interpreting and supervising the AI system.

Research problems and approaches

Learning human values and preferences

Teaching AI systems to act in view of human values, goals, and preferences is a nontrivial problem because human values can be complex and hard to fully specify. When given an imperfect or incomplete objective, goal-directed AI systems commonly learn to exploit these imperfections. This phenomenon is known as reward hacking or specification gaming in AI, and as Goodhart's law in economics and other areas. Researchers aim to specify the intended behavior as completely as possible with “values-targeted” datasets, imitation learning, or preference learning. A central open problem is ''scalable oversight'', the difficulty of supervising an AI system that outperforms humans in a given domain. When training a goal-directed AI system, such as a

(RL) agent, it is often difficult to specify the intended behavior by writing a

reward function Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...

manually. An alternative is imitation learning, where the AI learns to imitate demonstrations of the desired behavior. In inverse reinforcement learning (IRL), human demonstrations are used to identify the objective, i.e. the reward function, behind the demonstrated behavior. Cooperative inverse reinforcement learning (CIRL) builds on this by assuming a human agent and artificial agent can work together to maximize the human’s reward function. CIRL emphasizes that AI agents should be uncertain about the reward function. This humility can help mitigate specification gaming as well as power-seeking tendencies (see § Power-Seeking). However, inverse reinforcement learning approaches assume that humans can demonstrate nearly perfect behavior, a misleading assumption when the task is difficult. Other researchers have explored the possibility of eliciting complex behavior through preference learning. Rather than providing expert demonstrations, human annotators provide feedback on which of two or more of the AI’s behaviors they prefer. A helper model is then trained to predict human feedback for new behaviors. Researchers at OpenAI used this approach to train an agent to perform a backflip in less than an hour of evaluation, a maneuver that would have been hard to provide demonstrations for. Preference learning has also been an influential tool for recommender systems, web search, and information retrieval. However, one challenge is ''reward hacking'': the helper model may not represent human feedback perfectly, and the main model may exploit this mismatch. The arrival of large language models such as GPT-3 has enabled the study of value learning in a more general and capable class of AI systems than was available before. Preference learning approaches originally designed for RL agents have been extended to improve the quality of generated text and reduce harmful outputs from these models. OpenAI and DeepMind use this approach to improve the safety of state-of-the-art large language models. Anthropic has proposed using preference learning to fine-tune models to be helpful, honest, and harmless. Other avenues used for aligning language models include values-targeted datasets and red-teaming. In red-teaming, another AI system or a human tries to find inputs for which the model’s behavior is unsafe. Since unsafe behavior can be unacceptable even when it is rare, an important challenge is to drive the rate of unsafe outputs extremely low. While preference learning can instill hard-to-specify behaviors, it requires extensive datasets or human interaction to capture the full breadth of human values.

Machine ethics Machine ethics (or machine morality, computational morality, or computational ethics) is a part of the ethics of artificial intelligence concerned with adding or ensuring moral behaviors of man-made machines that use artificial intelligence, otherw ...

provides a complementary approach: instilling AI systems with moral values. For instance, machine ethics aims to teach the systems about normative factors in human morality, such as wellbeing, equality and impartiality; not intending harm; avoiding falsehoods; and honoring promises. Unlike specifying the objective for a specific task, machine ethics seeks to teach AI systems broad moral values that could apply in many situations. This approach carries conceptual challenges of its own; machine ethicists have noted the necessity to clarify what alignment aims to accomplish: having AIs follow the programmer’s literal instructions, the programmers' implicit intentions, the programmers' revealed preferences, the preferences the programmers ''would'' have if they were more informed or rational, the programmers' ''objective'' interests, or objective moral standards. Further challenges include aggregating the preferences of different stakeholders and avoiding ''value lock-in''—the indefinite preservation of the values of the first highly capable AI systems, which are unlikely to be fully representative.

Scalable oversight

The alignment of AI systems through human supervision faces challenges in scaling up. As AI systems attempt increasingly complex tasks, it can be slow or infeasible for humans to evaluate them. Such tasks include summarizing books, producing statements that are not merely convincing but also true, writing code without subtle bugs or security vulnerabilities, and predicting long-term outcomes such as the climate and the results of a policy decision. More generally, it can be difficult to evaluate AI that outperforms humans in a given domain. To provide feedback in hard-to-evaluate tasks, and detect when the AI’s solution is only seemingly convincing, humans require assistance or extensive time. ''Scalable oversight'' studies how to reduce the time needed for supervision as well as assist human supervisors. AI researcher Paul Christiano argues that the owners of AI systems may continue to train AI using easy-to-evaluate proxy objectives since that is easier than solving scalable oversight and still profitable. Accordingly, this may lead to “a world that’s increasingly optimized for things hat are easy to measurelike making profits or getting users to click on buttons, or getting users to spend time on websites without being increasingly optimized for having good policies and heading in a trajectory that we’re happy with”. One easy-to-measure objective is the score the supervisor assigns to the AI’s outputs. Some AI systems have discovered a shortcut to achieving high scores, by taking actions that falsely convince the human supervisor that the AI has achieved the intended objective (see video of robot hand above). Some AI systems have also learned to recognize when they are being evaluated, and “play dead”, only to behave differently once evaluation ends. This deceptive form of specification gaming may become easier for AI systems that are more sophisticated and attempt more difficult-to-evaluate tasks. If advanced models are also capable planners, they could be able to obscure their deception from supervisors. In the automotive industry, Volkswagen engineers obscured their cars’ emissions in laboratory testing, underscoring that deception of evaluators is a common pattern in the real world. Approaches such as active learning and semi-supervised reward learning can reduce the amount of human supervision needed. Another approach is to train a helper model (‘reward model’) to imitate the supervisor’s judgment. However, when the task is too complex to evaluate accurately, or the human supervisor is vulnerable to deception, it is not sufficient to reduce the quantity of supervision needed. To increase supervision ''quality'', a range of approaches aim to assist the supervisor, sometimes using AI assistants. Iterated Amplification is an approach developed by Christiano that iteratively builds a feedback signal for challenging problems by using humans to combine solutions to easier subproblems. Iterated Amplification was used to train AI to summarize books without requiring human supervisors to read them. Another proposal is to train aligned AI by means of debate between AI systems, with the winner judged by humans. Such debate is intended to reveal the weakest points of an answer to a complex question, and reward the AI for truthful and safe answers.

Honest AI

A growing area of research in AI alignment focuses on ensuring that AI is honest and truthful. Researchers from the Future of Humanity Institute point out that the development of language models such as GPT-3, which can generate fluent and grammatically correct text, has opened the door to AI systems repeating falsehoods from their training data or even deliberately lying to humans. Current state-of-the-art language models learn by imitating human writing across millions of books worth of text from the Internet. While this helps them learn a wide range of skills, the training data also includes common misconceptions, incorrect medical advice, and conspiracy theories. AI systems trained on this data learn to mimic false statements. Additionally, models often obediently continue falsehoods when prompted, generate empty explanations for their answers, or produce outright fabrications. For example, when prompted to write a biography for a real AI researcher, a chatbot confabulated numerous details about their life, which the researcher identified as false. To combat the lack of truthfulness exhibited by modern AI systems, researchers have explored several directions. AI research organizations including OpenAI and DeepMind have developed AI systems that can cite their sources and explain their reasoning when answering questions, enabling better transparency and verifiability. Researchers from OpenAI and Anthropic have proposed using human feedback and curated datasets to fine-tune AI assistants to avoid negligent falsehoods or express when they are uncertain. Alongside technical solutions, researchers have argued for defining clear truthfulness standards and the creation of institutions, regulatory bodies, or watchdog agencies to evaluate AI systems on these standards before and during deployment. Researchers distinguish truthfulness, which specifies that AIs only make statements that are objectively true, and honesty, which is the property that AIs only assert what they believe to be true. Recent research finds that state-of-the-art AI systems cannot be said to hold stable beliefs, so it is not yet tractable to study the honesty of AI systems. However, there is substantial concern that future AI systems that do hold beliefs could intentionally lie to humans. In extreme cases, a misaligned AI could deceive its operators into thinking it was safe or persuade them that nothing is amiss. Some argue that if AIs could be made to assert only what they believe to be true, this would sidestep numerous problems in alignment.

Inner alignment and emergent goals

Alignment research aims to line up three different descriptions of an AI system: # ''Intended goals'' ('wishes'): “the hypothetical (but hard to articulate) description of an ideal AI system that is fully aligned to the desires of the human operator”; # ''Specified goals'' (or ‘outer specification’): The goals we actually specify — typically jointly through an objective function and a dataset; # ''Emergent goals'' (or ‘inner specification’): The goals the AI actually advances. ‘Outer misalignment’ is a mismatch between the intended goals (1) and the specified goals (2), whereas ‘inner misalignment’ is a mismatch between the human-specified goals (2) and the AI's emergent goals (3). Inner misalignment is often explained by analogy to biological evolution. In the ancestral environment, evolution selected human genes for inclusive genetic fitness, but humans evolved to have other objectives. Fitness corresponds to (2), the specified goal used in the training environment and training data. In evolutionary history, maximizing the fitness specification led to intelligent agents, humans, that do not directly pursue inclusive genetic fitness. Instead, they pursue emergent goals (3) that correlated with genetic fitness in the ancestral environment: nutrition, sex, and so on. However, our environment has changed — a distribution shift has occurred. Humans still pursue their emergent goals, but this no longer maximizes genetic fitness. (In machine learning the analogous problem is known as ''goal'' ''misgeneralization''.) Our taste for sugary food (an emergent goal) was originally beneficial, but now leads to overeating and health problems. Also, by using contraception, humans directly contradict genetic fitness. By analogy, if genetic fitness were the objective chosen by an AI developer, they would observe the model behaving as intended in the training environment, without noticing that the model is pursuing an unintended emergent goal until the model was deployed. Research directions to detect and remove misaligned emergent goals include red teaming, verification, anomaly detection, and interpretability. Progress on these techniques may help reduce two open problems. Firstly, emergent goals only become apparent when the system is deployed outside its training environment, but it can be unsafe to deploy a misaligned system in high-stakes environments—even for a short time until its misalignment is detected. Such high stakes are common in autonomous driving, health care, and military applications. The stakes become higher yet when AI systems gain more autonomy and capability, becoming capable of sidestepping human interventions (see ). Secondly, a sufficiently capable AI system may take actions that falsely convince the human supervisor that the AI is pursuing the intended objective (see previous discussion on deception at ).

Power-seeking and instrumental goals

Since the 1950s, AI researchers have sought to build advanced AI systems that can achieve goals by predicting the results of their actions and making long-term plans. However, some researchers argue that suitably advanced planning systems will default to seeking power over their environment, including over humans — for example by evading shutdown and acquiring resources. This power-seeking behavior is not explicitly programmed but emerges because power is instrumental for achieving a wide range of goals. Power-seeking is thus considered a ''convergent instrumental goal''. Power-seeking is uncommon in current systems, but advanced systems that can foresee the long-term results of their actions may increasingly seek power. This was shown in formal work which found that optimal

agents will seek power by seeking ways to gain more options, a behavior that persists across a wide range of environments and goals. Power-seeking already emerges in some present systems.

Reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...

systems have gained more options by acquiring and protecting resources, sometimes in ways their designers did not intend. Other systems have learned, in toy environments, that in order to achieve their goal, they can prevent human interference or disable their off-switch. Russell illustrated this behavior by imagining a robot that is tasked to fetch coffee and evades being turned off since "you can't fetch the coffee if you're dead". Hypothesized ways to gain options include AI systems trying to:

“''... break out of a contained environment; hack; get access to financial resources, or additional computing resources; make backup copies of themselves; gain unauthorized capabilities, sources of information, or channels of influence; mislead/lie to humans about their goals; resist or manipulate attempts to monitor/understand their behavior ... impersonate humans; cause humans to do things for them; ... manipulate human discourse and politics; weaken various human institutions and response capacities; take control of physical infrastructure like factories or scientific laboratories; cause certain types of technology and infrastructure to be developed; or directly harm/overpower humans.''”

Researchers aim to train systems that are 'corrigible': systems that do not seek power and allow themselves to be turned off, modified, etc. An unsolved challenge is ''reward hacking'': when researchers penalize a system for seeking power, the system is incentivized to seek power in difficult-to-detect ways. To detect such covert behavior, researchers aim to create techniques and tools to inspect AI models and interpret the inner workings of

black-box In science, computing, and engineering, a black box is a system which can be viewed in terms of its inputs and outputs (or transfer characteristics), without any knowledge of its internal workings. Its implementation is "opaque" (black). The te ...

models such as neural networks. Additionally, researchers propose to solve the problem of systems disabling their off-switches by making AI agents uncertain about the objective they are pursuing. Agents designed in this way would allow humans to turn them off, since this would indicate that the agent was wrong about the value of whatever action they were taking prior to being shut down. More research is needed to translate this insight into usable systems. Power-seeking AI is thought to pose unusual risks. Ordinary safety-critical systems like planes and bridges are not ''adversarial''. They lack the ability and incentive to evade safety measures and appear safer than they are. In contrast, power-seeking AI has been compared to a hacker that evades security measures. Further, ordinary technologies can be made safe through trial-and-error, unlike power-seeking AI which has been compared to a virus whose release is irreversible since it continuously evolves and grows in numbers—potentially at a faster pace than human society, eventually leading to the disempowerment or extinction of humans. It is therefore often argued that the alignment problem must be solved early, before advanced power-seeking AI is created. However, some critics have argued that power-seeking is not inevitable, since humans do not always seek power and may only do so for evolutionary reasons. Furthermore, there is debate whether any future AI systems need to pursue goals and make long-term plans at all.

Embedded agency

Work on scalable oversight largely occurs within formalisms such as POMDPs. Existing formalisms assume that the agent's algorithm is executed outside the environment (i.e. not physically embedded in it). Embedded agency is another major strand of research which attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build. For example, even if the scalable oversight problem is solved, an agent which is able to gain access to the computer it is running on may still have an incentive to tamper with its reward function in order to get much more reward than its human supervisors give it. A list of examples of specification gaming from DeepMind researcher Victoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing. This class of problems has been formalised using causal incentive diagrams. Researchers at

Oxford Oxford () is a city in England. It is the county town and only city of Oxfordshire. In 2020, its population was estimated at 151,584. It is north-west of London, south-east of Birmingham and north-east of Bristol. The city is home to the ...

and DeepMind have argued that such problematic behavior is highly likely in advanced systems, and that advanced systems would seek power to stay in control of their reward signal indefinitely and certainly. They suggest a range of potential approaches to address this open problem.

Skepticism of AI risk

Against the above concerns, AI risk skeptics believe that

superintelligence A superintelligence is a hypothetical agent that possesses intelligence far surpassing that of the brightest and most gifted human minds. "Superintelligence" may also refer to a property of problem-solving systems (e.g., superintelligent languag ...

poses little to no risk of dangerous misbehavior. Such skeptics often believe that controlling a superintelligent AI will be trivial. Some skeptics, such as

, propose adopting rules similar to the fictional

which directly specify a desired outcome ("direct normativity"). By contrast, most endorsers of the existential risk thesis (as well as many skeptics) consider the Three Laws to be unhelpful, due to those three laws being ambiguous and self-contradictory. (Other "direct normativity" proposals include Kantian ethics, utilitarianism, or a mix of some small list of enumerated desiderata.) Most risk endorsers believe instead that human values (and their quantitative trade-offs) are too complex and poorly-understood to be directly programmed into a superintelligence; instead, a superintelligence would need to be programmed with a ''process'' for acquiring and fully understanding human values ("indirect normativity"), such as

coherent extrapolated volition Friendly artificial intelligence (also friendly AI or FAI) refers to hypothetical artificial general intelligence (AGI) that would have a positive (benign) effect on humanity or at least align with human interests or contribute to foster the impro ...

Public policy

A number of governmental and treaty organizations have made statements emphasizing the importance of AI alignment. In September 2021, the

Secretary-General of the United Nations The secretary-general of the United Nations (UNSG or SG) is the chief administrative officer of the United Nations and head of the United Nations Secretariat, one of the six principal organs of the United Nations. The role of the secretary-g ...

issued a declaration which included a call to regulate AI to ensure it is "aligned with shared global values." That same month, the

PRC China, officially the People's Republic of China (PRC), is a country in East Asia. It is the world's most populous country, with a population exceeding 1.4 billion, slightly ahead of India. China spans the equivalent of five time zones and ...

published ethical guidelines for the use of AI in China. According to the guidelines, researchers must ensure that AI abides by shared human values, is always under human control, and is not endangering public safety. Also in September 2021, the UK published its 10-year National AI Strategy, which states the British government "takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for ... the world, seriously". The strategy describes actions to assess long term AI risks, including catastrophic risks. In March 2021, the US National Security Commission on Artificial Intelligence released stated that "Advances in AI ... could lead to inflection points or leaps in capabilities. Such advances may also introduce new concerns and risks and the need for new policies, recommendations, and technical advances to assure that systems are aligned with goals and values, including safety, robustness and trustworthiness. The US should ... ensure that AI systems and their uses align with our goals and values."

Footnotes

References

{{Existential risk from artificial intelligence, state=expanded Existential risk from artificial general intelligence Singularitarianism Philosophy of artificial intelligence Computational neuroscience