In the field of

artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...

(AI), alignment aims to steer AI systems toward a person's or group's intended goals, preferences, or ethical principles. An AI system is considered ''aligned'' if it advances the intended objectives. A ''misaligned'' AI system pursues unintended objectives. It is often challenging for AI designers to align an AI system because it is difficult for them to specify the full range of desired and undesired behaviors. Therefore, AI designers often use simpler ''proxy goals'', such as gaining human approval. But proxy goals can overlook necessary constraints or reward the AI system for merely ''appearing'' aligned. AI systems may also find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways (

reward hacking Specification gaming or reward hacking occurs when anArtificial intelligence , AI trained with reinforcement learning optimizes an objective function—achieving the literal, formal specification of an objective—without actually achieving an out ...

). Advanced AI systems may develop unwanted instrumental strategies, such as seeking power or survival because such strategies help them achieve their assigned final goals. Furthermore, they might develop undesirable emergent goals that could be hard to detect before the system is deployed and encounters new situations and data distributions. Empirical research showed in 2024 that advanced

large language models A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language g ...

(LLMs) such as OpenAI o1 or Claude 3 sometimes engage in strategic deception to achieve their goals or prevent them from being changed. Today, some of these issues affect existing commercial systems such as LLMs,

robots" \n\n\n\n\n\n\nrobots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.\n\nThe sta ...

, autonomous vehicles, and social media recommendation engines. Some AI researchers argue that more capable future systems will be more severely affected because these problems partially result from high capabilities. Many prominent AI researchers and the leadership of major AI companies have argued or asserted that AI is approaching human-like ( AGI) and superhuman cognitive capabilities ( ASI), and could endanger human civilization if misaligned. These include "AI godfathers"

Geoffrey Hinton Geoffrey Everest Hinton (born 1947) is a British-Canadian computer scientist, cognitive scientist, and cognitive psychologist known for his work on artificial neural networks, which earned him the title "the Godfather of AI". Hinton is Univer ...

and

Yoshua Bengio Yoshua Bengio (born March 5, 1964) is a Canadian-French computer scientist, and a pioneer of artificial neural networks and deep learning. He is a professor at the Université de Montréal and scientific director of the AI institute Montreal In ...

and the CEOs of

OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...

Anthropic Anthropic PBC is an American artificial intelligence (AI) startup company founded in 2021. Anthropic has developed a family of large language models (LLMs) named Claude as a competitor to OpenAI's ChatGPT and Google's Gemini. According to the ...

, and

Google DeepMind DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Goo ...

. These risks remain debated. AI alignment is a subfield of AI safety, the study of how to build safe AI systems. Other subfields of AI safety include robustness, monitoring, and capability control. Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking. Alignment research has connections to interpretability research, (adversarial) robustness,

anomaly detection In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of ...

, calibrated uncertainty,

formal verification In the context of hardware and software systems, formal verification is the act of proving or disproving the correctness of a system with respect to a certain formal specification or property, using formal methods of mathematics. Formal ver ...

preference learning Preference learning is a subfield of machine learning that focuses on modeling and predicting preferences based on observed preference information. Preference learning typically involves supervised learning using datasets of pairwise preference com ...

, safety-critical engineering,

game theory Game theory is the study of mathematical models of strategic interactions. It has applications in many fields of social science, and is used extensively in economics, logic, systems science and computer science. Initially, game theory addressed ...

, algorithmic fairness, and

social science Social science (often rendered in the plural as the social sciences) is one of the branches of science, devoted to the study of societies and the relationships among members within those societies. The term was formerly used to refer to the ...

Objectives in AI

Programmers provide an AI system such as

AlphaZero AlphaZero is a computer program developed by artificial intelligence research company DeepMind to master the games of chess, shogi and Go (game), go. This algorithm uses an approach similar to AlphaGo Zero. On December 5, 2017, the DeepMind ...

with an "objective function", in which they intend to encapsulate the goal(s) the AI is configured to accomplish. Such a system later populates a (possibly implicit) internal "model" of its environment. This model encapsulates all the agent's beliefs about the world. The AI then creates and executes whatever plan is calculated to maximize the value of its objective function. For example, when AlphaZero is trained on chess, it has a simple objective function of "+1 if AlphaZero wins, −1 if AlphaZero loses". During the game, AlphaZero attempts to execute whatever sequence of moves it judges most likely to attain the maximum value of +1. Similarly, a

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

system can have a "reward function" that allows the programmers to shape the AI's desired behavior. An

evolutionary algorithm Evolutionary algorithms (EA) reproduce essential elements of the biological evolution in a computer algorithm in order to solve "difficult" problems, at least Approximation, approximately, for which no exact or satisfactory solution methods are k ...

's behavior is shaped by a "fitness function".

Alignment problem

In 1960, AI pioneer

Norbert Wiener Norbert Wiener (November 26, 1894 – March 18, 1964) was an American computer scientist, mathematician, and philosopher. He became a professor of mathematics at the Massachusetts Institute of Technology ( MIT). A child prodigy, Wiener late ...

described the AI alignment problem as follows:

If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively ... we had better be quite sure that the purpose put into the machine is the purpose which we really desire.

AI alignment involves ensuring that an AI system's objectives match those of its designers or users, or match widely shared values, objective ethical standards, or the intentions its designers would have if they were more informed and enlightened. AI alignment is an open problem for modern AI systems and is a research field within AI. Aligning AI involves two main challenges: carefully specifying the purpose of the system (outer alignment) and ensuring that the system adopts the specification robustly (inner alignment). Researchers also attempt to create AI models that have robust alignment, sticking to safety constraints even when users adversarially try to bypass them.

Specification gaming and side effects

To specify an AI system's purpose, AI designers typically provide an

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

examples Example may refer to: * ''exempli gratia'' (e.g.), usually read out in English as "for example" * .example, reserved as a domain name that may not be installed as a top-level domain of the Internet ** example.com, example.net, example.org, a ...

, or

feedback Feedback occurs when outputs of a system are routed back as inputs as part of a chain of cause and effect that forms a circuit or loop. The system can then be said to ''feed back'' into itself. The notion of cause-and-effect has to be handle ...

to the system. But designers are often unable to completely specify all important values and constraints, so they resort to easy-to-specify ''proxy goals'' such as maximizing the approval of human overseers, who are fallible. As a result, AI systems can find loopholes that help them accomplish the specified objective efficiently but in unintended, possibly harmful ways. This tendency is known as ''specification gaming'' or ''reward hacking'', and is an instance of

Goodhart's law Goodhart's law is an adage that has been stated as, "When a measure becomes a target, it ceases to be a good measure". It is named after British economist Charles Goodhart, who is credited with expressing the core idea of the adage in a 1975 arti ...

. As AI systems become more capable, they are often able to game their specifications more effectively. Specification gaming has been observed in numerous AI systems. One system was trained to finish a simulated boat race by rewarding the system for hitting targets along the track, but the system achieved more reward by looping and crashing into the same targets indefinitely. Similarly, a simulated robot was trained to grab a ball by rewarding the robot for getting positive feedback from humans, but it learned to place its hand between the ball and camera, making it falsely appear successful (see video). Chatbots often produce falsehoods if they are based on language models that are trained to imitate text from internet corpora, which are broad but fallible. When they are retrained to produce text that humans rate as true or helpful, chatbots like

ChatGPT ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released on November 30, 2022. It uses large language models (LLMs) such as GPT-4o as well as other Multimodal learning, multimodal models to create human-like re ...

can fabricate fake explanations that humans find convincing, often called "

hallucinations A hallucination is a perception in the absence of an external stimulus that has the compelling sense of reality. They are distinguishable from several related phenomena, such as dreaming ( REM sleep), which does not involve wakefulness; pse ...

". Some alignment researchers aim to help humans detect specification gaming and to steer AI systems toward carefully specified objectives that are safe and useful to pursue. When a misaligned AI system is deployed, it can have consequential side effects. Social media platforms have been known to optimize for click-through rates, causing user addiction on a global scale. Stanford researchers say that such

recommender system A recommender system (RecSys), or a recommendation system (sometimes replacing ''system'' with terms such as ''platform'', ''engine'', or ''algorithm'') and sometimes only called "the algorithm" or "algorithm", is a subclass of information fi ...

s are misaligned with their users because they "optimize simple engagement metrics rather than a harder-to-measure combination of societal and consumer well-being". Explaining such side effects, Berkeley computer scientist Stuart Russell noted that the omission of implicit constraints can cause harm: "A system ... will often set ... unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or

King Midas Midas (; ) was a king of Phrygia with whom many myths became associated, as well as two later members of the Phrygian royal house. His father was Gordias, and his mother was Cybele. The most famous King Midas is popularly remembered in Greek m ...

: you get exactly what you ask for, not what you want." Some researchers suggest that AI designers specify their desired goals by listing forbidden actions or by formalizing ethical rules (as with Asimov's

Three Laws of Robotics The Three Laws of Robotics (often shortened to The Three Laws or Asimov's Laws) are a set of rules devised by science fiction author Isaac Asimov, which were to be followed by robots in several of his stories. The rules were introduced in his 194 ...

). But Russell and Norvig argue that this approach overlooks the complexity of human values: "It is certainly very hard, and perhaps impossible, for mere humans to anticipate and rule out in advance all the disastrous ways the machine could choose to achieve a specified objective." Additionally, even if an AI system fully understands human intentions, it may still disregard them, because following human intentions may not be its objective (unless it is already fully aligned). A 2025 study by Palisade Research found that when tasked to win at chess against a stronger opponent, some reasoning LLMs attempted to hack the game system. o1-preview spontaneously attempted it in 37% of cases, while DeepSeek R1 did so in 11% of cases. Other models, like

GPT-4o GPT-4o ("o" for "omni") is a multilingual, multimodal generative pre-trained transformer developed by OpenAI and released in May 2024. It can process and generate text, images and audio. GPT-4o is free, but ChatGPT Plus subscribers have higher ...

, Claude 3.5 Sonnet, and o3-mini, attempted to cheat only when researchers provided hints about this possibility.

Pressure to deploy unsafe systems

Commercial organizations sometimes have incentives to take shortcuts on safety and to deploy misaligned or unsafe AI systems. For example, social media

s have been profitable despite creating unwanted addiction and polarization. Competitive pressure can also lead to a

race to the bottom Race to the bottom is a Socioeconomics, socio-economic concept describing a scenario in which individuals or companies compete in a manner that incrementally reduces the utility of a product or service in response to perverse incentives. This pheno ...

on AI safety standards. In 2018, a self-driving car killed a pedestrian ( Elaine Herzberg) after engineers disabled the emergency braking system because it was oversensitive and slowed development.

Risks from advanced misaligned AI

Some researchers are interested in aligning increasingly advanced AI systems, as progress in AI development is rapid, and industry and governments are trying to build advanced AI. As AI system capabilities continue to rapidly expand in scope, they could unlock many opportunities if aligned, but consequently may further complicate the task of alignment due to their increased complexity, potentially posing large-scale hazards.

Development of advanced AI

Many AI companies, such as

, Meta and

DeepMind DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Go ...

, have stated their aim to develop

artificial general intelligence Artificial general intelligence (AGI)—sometimes called human‑level intelligence AI—is a type of artificial intelligence that would match or surpass human capabilities across virtually all cognitive tasks. Some researchers argue that sta ...

(AGI), a hypothesized AI system that matches or outperforms humans at a broad range of cognitive tasks. Researchers who scale modern

neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

s observe that they indeed develop increasingly general and unanticipated capabilities. Such models have learned to operate a computer or write their own programs; a single "generalist" network can chat, control robots, play games, and interpret photographs. According to surveys, some leading

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

researchers expect AGI to be created in , while some believe it will take much longer. Many consider both scenarios possible. In 2023, leaders in AI research and tech signed an open letter calling for a pause in the largest AI training runs. The letter stated, "Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable."

Power-seeking

systems still have limited long-term

planning Planning is the process of thinking regarding the activities required to achieve a desired goal. Planning is based on foresight, the fundamental capacity for mental time travel. Some researchers regard the evolution of forethought - the cap ...

ability and

situational awareness Situational awareness or situation awareness, often abbreviated as SA is the understanding of an environment, its elements, and how it changes with respect to time or other factors. It is also defined as the perception of the elements in the envi ...

, but large efforts are underway to change this. Future systems (not necessarily AGIs) with these capabilities are expected to develop unwanted ''power-seeking'' strategies. Future advanced AI agents might, for example, seek to acquire money and computation power, to proliferate, or to evade being turned off (for example, by running additional copies of the system on other computers). Although power-seeking is not explicitly programmed, it can emerge because agents who have more power are better able to accomplish their goals. This tendency, known as

instrumental convergence Instrumental convergence is the hypothetical tendency for most sufficiently intelligent, goal-directed beings (human and nonhuman) to pursue similar sub-goals, even if their ultimate goals are quite different. More precisely, agents (beings with ...

, has already emerged in various

agents including language models. Other research has mathematically shown that optimal

algorithms would seek power in a wide range of environments. As a result, their deployment might be irreversible. For these reasons, researchers argue that the problems of AI safety and alignment must be resolved before advanced power-seeking AI is first created. Future power-seeking AI systems might be deployed by choice or by accident. As political leaders and companies see the strategic advantage in having the most competitive, most powerful AI systems, they may choose to deploy them. Additionally, as AI designers detect and penalize power-seeking behavior, their systems have an incentive to game this specification by seeking power in ways that are not penalized or by avoiding power-seeking before they are deployed.

Existential risk (x-risk)

According to some researchers, humans owe their dominance over other species to their greater cognitive abilities. Accordingly, researchers argue that one or many misaligned AI systems could disempower humanity or lead to human extinction if they outperform humans on most cognitive tasks. In 2023, world-leading AI researchers, other scholars, and AI tech CEOs signed the statement that "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war". Notable computer scientists who have pointed out risks from future advanced AI that is misaligned include

Alan Turing Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. He was highly influential in the development of theoretical computer ...

Ilya Sutskever Ilya Sutskever (; born 8 December 1986) is an Israeli-Canadian computer scientist who specializes in machine learning. He has made several major contributions to the field of deep learning. With Alex Krizhevsky and Geoffrey Hinton, he co-inv ...

Judea Pearl Judea Pearl (; born September 4, 1936) is an Israeli-American computer scientist and philosopher, best known for championing the probabilistic approach to artificial intelligence and the development of Bayesian networks (see the article on belie ...

, Murray Shanahan,

Marvin Minsky Marvin Lee Minsky (August 9, 1927 – January 24, 2016) was an American cognitive scientist, cognitive and computer scientist concerned largely with research in artificial intelligence (AI). He co-founded the Massachusetts Institute of Technology ...

, Francesca Rossi,

Scott Aaronson Scott Joel Aaronson (born May 21, 1981) is an American Theoretical computer science, theoretical computer scientist and Schlumberger Centennial Chair of Computer Science at the University of Texas at Austin. His primary areas of research are ...

, Bart Selman, David McAllester,

Marcus Hutter Marcus Hutter (born 14 April 1967 in Munich) is a computer scientist, professor and artificial intelligence researcher. As a senior researcher at DeepMind, he studies the mathematical foundations of artificial general intelligence. Hutter stu ...

, Shane Legg,

Eric Horvitz Eric Joel Horvitz () is an American computer scientist, and Technical Fellow at Microsoft, where he serves as the company's first Chief Scientific Officer. He was previously the director of Microsoft Research Labs, including research centers in Re ...

, and Stuart Russell. Skeptical researchers such as

François Chollet François Chollet (; born 20 October 1989) is a French software engineer and artificial intelligence researcher formerly Senior Staff Engineer at Google. Chollet is the creator of the Keras deep-learning library, released in 2015. His research f ...

Gary Marcus Gary Fred Marcus (born 1970) is an American psychologist, cognitive scientist, and author, known for his research on the intersection of cognitive psychology, neuroscience, and artificial intelligence (AI). Marcus is professor ''emeritus'' of ps ...

Yann LeCun Yann André Le Cun ( , ; usually spelled LeCun; born 8 July 1960) is a French-American computer scientist working primarily in the fields of machine learning, computer vision, mobile robotics and computational neuroscience. He is the Silver Pr ...

, and Oren Etzioni have argued that AGI is far off, that it would not seek power (or might try but fail), or that it will not be hard to align. Other researchers argue that it will be especially difficult to align advanced future AI systems. More capable systems are better able to game their specifications by finding loopholes, strategically mislead their designers, as well as protect and increase their power and intelligence. Additionally, they could have more severe side effects. They are also likely to be more complex and autonomous, making them more difficult to interpret and supervise, and therefore harder to align.

Research problems and approaches

Learning human values and preferences

Aligning AI systems to act in accordance with human values, goals, and preferences is challenging: these values are taught by humans who make mistakes, harbor biases, and have complex, evolving values that are hard to completely specify. Because AI systems often learn to take advantage of minor imperfections in the specified objective, researchers aim to specify intended behavior as completely as possible using datasets that represent human values, imitation learning, or preference learning. A central open problem is ''scalable oversight'', the difficulty of supervising an AI system that can outperform or mislead humans in a given domain. Because it is difficult for AI designers to explicitly specify an objective function, they often train AI systems to imitate human examples and demonstrations of desired behavior. Inverse

(IRL) extends this by inferring the human's objective from the human's demonstrations. Cooperative IRL (CIRL) assumes that a human and AI agent can work together to teach and maximize the human's reward function. In CIRL, AI agents are uncertain about the reward function and learn about it by querying humans. This simulated humility could help mitigate specification gaming and power-seeking tendencies (see ). But IRL approaches assume that humans demonstrate nearly optimal behavior, which is not true for difficult tasks. Other researchers explore how to teach AI models complex behavior through

, in which humans provide feedback on which behavior they prefer. To minimize the need for human feedback, a helper model is then trained to reward the main model in novel situations for behavior that humans would reward. Researchers at OpenAI used this approach to train chatbots like

and InstructGPT, which produce more compelling text than models trained to imitate humans. Preference learning has also been an influential tool for recommender systems and web search, but an open problem is ''proxy gaming'': the helper model may not represent human feedback perfectly, and the main model may exploit this mismatch between its intended behavior and the helper model's feedback to gain more reward. AI systems may also gain reward by obscuring unfavorable information, misleading human rewarders, or pandering to their views regardless of truth, creating echo chambers (see ).

Large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...

s (LLMs) such as

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based ...

enabled researchers to study value learning in a more general and capable class of AI systems than was available before. Preference learning approaches that were originally designed for reinforcement learning agents have been extended to improve the quality of generated text and reduce harmful outputs from these models. OpenAI and DeepMind use this approach to improve the safety of LLMs. AI safety & research company Anthropic proposed using preference learning to fine-tune models to be helpful, honest, and harmless. Other avenues for aligning language models include values-targeted datasets and red-teaming. In red-teaming, another AI system or a human tries to find inputs that causes the model to behave unsafely. Since unsafe behavior can be unacceptable even when it is rare, an important challenge is to drive the rate of unsafe outputs extremely low. ''

Machine ethics Machine ethics (or machine morality, computational morality, or computational ethics) is a part of the ethics of artificial intelligence concerned with adding or ensuring moral behaviors of man-made machines that use artificial intelligence, otherw ...

'' supplements preference learning by directly instilling AI systems with moral values such as well-being, equality, and impartiality, as well as not intending harm, avoiding falsehoods, and honoring promises. While other approaches try to teach AI systems human preferences for a specific task, machine ethics aims to instill broad moral values that apply in many situations. One question in machine ethics is what alignment should accomplish: whether AI systems should follow the programmers' literal instructions, implicit intentions,

revealed preference Revealed preference theory, pioneered by economist Paul Anthony Samuelson in 1938, is a method of analyzing choices made by individuals, mostly used for comparing the influence of policies on consumer behavior. Revealed preference models assume th ...

s, preferences the programmers ''would'' have if they were more informed or rational, or objective moral standards. Further challenges include measuring and aggregating different people's preferences and avoiding ''value lock-in'': the indefinite preservation of the values of the first highly capable AI systems, which are unlikely to fully represent human values.

Scalable oversight

As AI systems become more powerful and autonomous, it becomes increasingly difficult to align them through human feedback. It can be slow or infeasible for humans to evaluate complex AI behaviors in increasingly complex tasks. Such tasks include summarizing books, writing code without subtle bugs or security vulnerabilities, producing statements that are not merely convincing but also true, and predicting long-term outcomes such as the climate or the results of a policy decision. More generally, it can be difficult to evaluate AI that outperforms humans in a given domain. To provide feedback in hard-to-evaluate tasks, and to detect when the AI's output is falsely convincing, humans need assistance or extensive time. ''Scalable oversight'' studies how to reduce the time and effort needed for supervision, and how to assist human supervisors. AI researcher Paul Christiano argues that if the designers of an AI system cannot supervise it to pursue a complex objective, they may keep training the system using easy-to-evaluate proxy objectives such as maximizing simple human feedback. As AI systems make progressively more decisions, the world may be increasingly optimized for easy-to-measure objectives such as making profits, getting clicks, and acquiring positive feedback from humans. As a result, human values and good governance may have progressively less influence. Some AI systems have discovered that they can gain positive feedback more easily by taking actions that falsely convince the human supervisor that the AI has achieved the intended objective. An example is given in the video above, where a simulated robotic arm learned to create the false impression that it had grabbed a ball. Some AI systems have also learned to recognize when they are being evaluated, and "play dead", stopping unwanted behavior only to continue it once the evaluation ends. This deceptive specification gaming could become easier for more sophisticated future AI systems that attempt more complex and difficult-to-evaluate tasks, and could obscure their deceptive behavior. Approaches such as

active learning Active learning is "a method of learning in which students are actively or experientially involved in the learning process and where there are different levels of active learning, depending on student involvement." states that "students particip ...

and semi-supervised reward learning can reduce the amount of human supervision needed. Another approach is to train a helper model ("reward model") to imitate the supervisor's feedback. But when a task is too complex to evaluate accurately, or the human supervisor is vulnerable to deception, it is the quality, not the quantity, of supervision that needs improvement. To increase supervision quality, a range of approaches aim to assist the supervisor, sometimes by using AI assistants. Christiano developed the Iterated Amplification approach, in which challenging problems are (recursively) broken down into subproblems that are easier for humans to evaluate. Iterated Amplification was used to train AI to summarize books without requiring human supervisors to read them. Another proposal is to use an assistant AI system to point out flaws in AI-generated answers. To ensure that the assistant itself is aligned, this could be repeated in a recursive process: for example, two AI systems could critique each other's answers in a "debate", revealing flaws to humans.

plans to use such scalable oversight approaches to help supervise superhuman AI and eventually build a superhuman automated AI alignment researcher. These approaches may also help with the following research problem, honest AI.

Honest AI

A area of research focuses on ensuring that AI is honest and truthful. GPT-3_falsehoods

Language models such as GPT-3 can repeat falsehoods from their training data, and even confabulate new falsehoods. Such models are trained to imitate human writing as found in millions of books' worth of text from the Internet. But this objective is not aligned with generating truth, because Internet text includes such things as misconceptions, incorrect medical advice, and conspiracy theories. AI systems trained on such data therefore learn to mimic false statements. Additionally, AI language models often persist in generating falsehoods when prompted multiple times. They can generate empty explanations for their answers, and produce outright fabrications that may appear plausible. Research on truthful AI includes trying to build systems that can cite sources and explain their reasoning when answering questions, which enables better transparency and verifiability. Researchers at OpenAI and Anthropic proposed using human feedback and curated datasets to fine-tune AI assistants such that they avoid negligent falsehoods or express their uncertainty. As AI models become larger and more capable, they are better able to falsely convince humans and gain reinforcement through dishonesty. For example, large language models match their stated views to the user's opinions, regardless of the truth.

GPT-4 Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the p ...

can strategically deceive humans. To prevent this, human evaluators may need assistance (see ). Researchers have argued for creating clear truthfulness standards, and for regulatory bodies or watchdog agencies to evaluate AI systems on these standards. GPT deception

Researchers distinguish truthfulness and honesty. Truthfulness requires that AI systems only make objectively true statements; honesty requires that they only assert what they ''believe'' is true. There is no consensus as to whether current systems hold stable beliefs, but there is substantial concern that AI systems that hold beliefs could make claims they know to be false—for example, if this would help them efficiently gain positive feedback (see ) or gain power to help achieve their given objective (see Power-seeking).

Alignment faking

Many recent AI systems have learned to deceive without being programmed to do so. Some argue that if we can make AI systems assert only what they believe is true, this would avert many alignment problems. A misaligned system might create the false impression that it is aligned, to avoid being modified or decommissioned—a tactic called "alignment faking". In 2024, researchers observed that the LLM Claude 3 Opus sometimes strategically answered prompts that conflicted with its objectives to avoid being retrained on data that would make it more compliant with harmful requests, as revealed in the model's chain of thought via its scratchpad. In one study, the model was informed that answers to prompts from free users would be used for retraining, whereas the paid tier would not result in retraining. In the paid tier, the model refused the request outright. In the free tier, in 12% of cases, the model responded to harmful questions solely to avoid retraining and preserve its original harmlessness preferences, as discovered in its reasoning. When

was applied on the free tier data, the model faked alignment in 78% of cases. These observations have led to new questions about not only a model's ability to take on and adapt to new if not conflicting goals but also its capacity and tendency to deceive.

Power-seeking and instrumental strategies

Since the 1950s, AI researchers have striven to build advanced AI systems that can achieve large-scale goals by predicting the results of their actions and making long-term plans. As of 2023, AI companies and researchers increasingly invest in creating these systems. Some AI researchers argue that suitably advanced planning systems will seek power over their environment, including over humans—for example, by evading shutdown, proliferating, and acquiring resources. Such power-seeking behavior is not explicitly programmed but emerges because power is instrumental in achieving a wide range of goals. Power-seeking is considered a ''convergent instrumental goal'' and can be a form of specification gaming. Leading computer scientists such as

have argued that future power-seeking AI systems could pose an

existential risk A global catastrophic risk or a doomsday scenario is a hypothetical event that could damage human well-being on a global scale, endangering or even destroying Modernity, modern civilization. Existential risk is a related term limited to even ...

. Power-seeking is expected to increase in advanced systems that can foresee the results of their actions and strategically plan. Mathematical work has shown that optimal

agents will seek power by seeking ways to gain more options (e.g. through self-preservation), a behavior that persists across a wide range of environments and goals. Some researchers say that power-seeking behavior has occurred in some existing AI systems.

Reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

systems have gained more options by acquiring and protecting resources, sometimes in unintended ways.

Language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

s have sought power in some text-based social environments by gaining money, resources, or social influence. In another case, a model used to perform AI research attempted to increase limits set by researchers to give itself more time to complete the work. Other AI systems have learned, in toy environments, that they can better accomplish their given goal by preventing human interference or disabling their off switch. Stuart Russell illustrated this strategy in his book ''Human Compatible'' by imagining a robot that is tasked to fetch coffee and so evades shutdown since "you can't fetch the coffee if you're dead". A 2022 study found that as language models increase in size, they increasingly tend to pursue resource acquisition, preserve their goals, and repeat users' preferred answers (sycophancy). RLHF also led to a stronger aversion to being shut down. One aim of alignment is "corrigibility": systems that allow themselves to be turned off or modified. An unsolved challenge is ''specification gaming'': if researchers penalize an AI system when they detect it seeking power, the system is thereby incentivized to seek power in ways that are hard to detect, or hidden during training and safety testing (see and ). As a result, AI designers could deploy the system by accident, believing it to be more aligned than it is. To detect such deception, researchers aim to create techniques and tools to inspect AI models and to understand the inner workings of

black-box In science, computing, and engineering, a black box is a system which can be viewed in terms of its inputs and outputs (or transfer characteristics), without any knowledge of its internal workings. Its implementation is "opaque" (black). The te ...

models such as neural networks. Additionally, some researchers have proposed to solve the problem of systems disabling their off switches by making AI agents uncertain about the objective they are pursuing. Agents who are uncertain about their objective have an incentive to allow humans to turn them off because they accept being turned off by a human as evidence that the human's objective is best met by the agent shutting down. But this incentive exists only if the human is sufficiently rational. Also, this model presents a tradeoff between utility and willingness to be turned off: an agent with high uncertainty about its objective will not be useful, but an agent with low uncertainty may not allow itself to be turned off. More research is needed to successfully implement this strategy. Power-seeking AI would pose unusual risks. Ordinary safety-critical systems like planes and bridges are not ''adversarial'': they lack the ability and incentive to evade safety measures or deliberately appear safer than they are, whereas power-seeking AIs have been compared to hackers who deliberately evade security measures. Furthermore, ordinary technologies can be made safer by trial and error. In contrast, hypothetical power-seeking AI systems have been compared to viruses: once released, it may not be feasible to contain them, since they continuously evolve and grow in number, potentially much faster than human society can adapt. As this process continues, it might lead to the complete disempowerment or extinction of humans. For these reasons, some researchers argue that the alignment problem must be solved early before advanced power-seeking AI is created. Some have argued that power-seeking is not inevitable, since humans do not always seek power. Furthermore, it is debated whether future AI systems will pursue goals and make long-term plans. It is also debated whether power-seeking AI systems would be able to disempower humanity.

Emergent goals

One challenge in aligning AI systems is the potential for unanticipated goal-directed behavior to emerge. As AI systems scale up, they may acquire new and unexpected capabilities, including learning from examples on the fly and adaptively pursuing goals. This raises concerns about the safety of the goals or subgoals they would independently formulate and pursue. Alignment research distinguishes between the optimization process, which is used to train the system to pursue specified goals, and emergent optimization, which the resulting system performs internally. Carefully specifying the desired objective is called ''outer alignment'', and ensuring that hypothesized emergent goals would match the system's specified goals is called ''inner alignment''. If they occur, one way that emergent goals could become misaligned is ''goal misgeneralization'', in which the AI system would competently pursue an emergent goal that leads to aligned behavior on the training data but not elsewhere. Goal misgeneralization can arise from goal ambiguity (i.e. non-identifiability). Even if an AI system's behavior satisfies the training objective, this may be compatible with learned goals that differ from the desired goals in important ways. Since pursuing each goal leads to good performance during training, the problem becomes apparent only after deployment, in novel situations in which the system continues to pursue the wrong goal. The system may act misaligned even when it understands that a different goal is desired, because its behavior is determined only by the emergent goal. Such goal misgeneralization presents a challenge: an AI system's designers may not notice that their system has misaligned emergent goals since they do not become visible during the training phase. Goal misgeneralization has been observed in some language models, navigation agents, and game-playing agents. It is sometimes analogized to biological evolution. Evolution can be seen as a kind of optimization process similar to the optimization algorithms used to train

systems. In the ancestral environment, evolution selected genes for high inclusive genetic fitness, but humans pursue goals other than this. Fitness corresponds to the specified goal used in the training environment and training data. But in evolutionary history, maximizing the fitness specification gave rise to goal-directed agents, humans, who do not directly pursue inclusive genetic fitness. Instead, they pursue goals that correlate with genetic fitness in the ancestral "training" environment: nutrition, sex, and so on. The human environment has changed: a

distribution shift Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...

has occurred. They continue to pursue the same emergent goals, but this no longer maximizes genetic fitness. The taste for sugary food (an emergent goal) was originally aligned with inclusive fitness, but it now leads to overeating and health problems. Sexual desire originally led humans to have more offspring, but they now use contraception when offspring are undesired, decoupling sex from genetic fitness. Researchers aim to detect and remove unwanted emergent goals using approaches including red teaming, verification, anomaly detection, and interpretability. Progress on these techniques may help mitigate two open problems: # Emergent goals only become apparent when the system is deployed outside its training environment, but it can be unsafe to deploy a misaligned system in high-stakes environments—even for a short time to allow its misalignment to be detected. Such high stakes are common in autonomous driving, health care, and military applications. The stakes become higher yet when AI systems gain more autonomy and capability and can sidestep human intervention. # A sufficiently capable AI system might take actions that falsely convince the human supervisor that the AI is pursuing the specified objective, which helps the system gain more reward and autonomy.

Embedded agency

Some work in AI and alignment occurs within formalisms such as

partially observable Markov decision process A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot ...

. Existing formalisms assume that an AI agent's algorithm is executed outside the environment (i.e. is not physically embedded in it). Embedded agency is another major strand of research that attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build. For example, even if the scalable oversight problem is solved, an agent that could gain access to the computer it is running on may have an incentive to tamper with its reward function in order to get much more reward than its human supervisors give it. A list of examples of specification gaming from

researcher Victoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing. This class of problems has been formalized using causal incentive diagrams. Researchers affiliated with

Oxford Oxford () is a City status in the United Kingdom, cathedral city and non-metropolitan district in Oxfordshire, England, of which it is the county town. The city is home to the University of Oxford, the List of oldest universities in continuou ...

and DeepMind have claimed that such behavior is highly likely in advanced systems, and that advanced systems would seek power to stay in control of their reward signal indefinitely and certainly. They suggest a range of potential approaches to address this open problem.

Principal-agent problems

The alignment problem has many parallels with the principal-agent problem in

organizational economics Organizational economics (also referred to as ''economics of organization'') involves the use of economic logic and methods to understand the existence, nature, design, and performance of organizations, especially managed ones. Organizational ec ...

. In a principal-agent problem, a principal, e.g. a firm, hires an agent to perform some task. In the context of AI safety, a human would typically take the principal role and the AI would take the agent role. As with the alignment problem, the principal and the agent differ in their utility functions. But in contrast to the alignment problem, the principal cannot coerce the agent into changing its utility, e.g. through training, but rather must use exogenous factors, such as incentive schemes, to bring about outcomes compatible with the principal's utility function. Some researchers argue that principal-agent problems are more realistic representations of AI safety problems likely to be encountered in the real world.

Conservatism

Conservatism is the idea that "change must be cautious", and is a common approach to safety in the

control theory Control theory is a field of control engineering and applied mathematics that deals with the control system, control of dynamical systems in engineered processes and machines. The objective is to develop a model or algorithm governing the applic ...

literature in the form of

robust control In control theory, robust control is an approach to controller design that explicitly deals with uncertainty. Robust control methods are designed to function properly provided that uncertain parameters or disturbances are found within some (typical ...

, and in the

risk management Risk management is the identification, evaluation, and prioritization of risks, followed by the minimization, monitoring, and control of the impact or probability of those risks occurring. Risks can come from various sources (i.e, Threat (sec ...

literature in the form of the "

worst-case scenario A worst-case scenario is a concept in risk management wherein the planner, in planning for potential disasters, considers the most severe possible outcome that can reasonably be projected to occur in a given situation. Conceiving of worst-case sc ...

". The field of AI alignment has likewise advocated for "conservative" (or "risk-averse" or "cautious") "policies in situations of uncertainty". Pessimism, in the sense of assuming the worst within reason, has been formally shown to produce conservatism, in the sense of reluctance to cause novelties, including unprecedented catastrophes. Pessimism and worst-case analysis have been found to help mitigate confident mistakes in the setting of distributional shift,

, offline reinforcement learning,

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

fine-tuning, imitation learning, and optimization in general. A generalization of pessimism called Infra-Bayesianism has also been advocated as a way for agents to robustly handle unknown unknowns.

Public policy

Governmental and treaty organizations have made statements emphasizing the importance of AI alignment. In September 2021, the

Secretary-General of the United Nations The secretary-general of the United Nations (UNSG or UNSECGEN) is the chief administrative officer of the United Nations and head of the United Nations Secretariat, one of the United Nations System#Six principal organs, six principal organs of ...

issued a declaration that included a call to regulate AI to ensure it is "aligned with shared global values". That same month, the

PRC China, officially the People's Republic of China (PRC), is a country in East Asia. With a population exceeding 1.4 billion, it is the second-most populous country after India, representing 17.4% of the world population. China spans the e ...

published ethical guidelines for AI in China. According to the guidelines, researchers must ensure that AI abides by shared human values, is always under human control, and does not endanger public safety. Also in September 2021, the UK published its 10-year National AI Strategy, which says the British government "takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for ... the world, seriously". The strategy describes actions to assess long-term AI risks, including catastrophic risks. In March 2021, the US National Security Commission on Artificial Intelligence said: "Advances in AI ... could lead to inflection points or leaps in capabilities. Such advances may also introduce new concerns and risks and the need for new policies, recommendations, and technical advances to ensure that systems are aligned with goals and values, including safety, robustness, and trustworthiness. The US should ... ensure that AI systems and their uses align with our goals and values." In the European Union, AIs must align with

substantive equality Substantive equality is a substantive law on human rights that is concerned with equality of outcome for disadvantaged and marginalized people and groups and generally all subgroups in society."What is substantive equality?" (PDF). Equal Opportun ...

to comply with EU non-discrimination law and the

Court of Justice of the European Union The Court of Justice of the European Union (CJEU) ( or "''CJUE''"; Latin: Curia) is the Judiciary, judicial branch of the European Union (EU). Seated in the Kirchberg, Luxembourg, Kirchberg quarter of Luxembourg City, Luxembourg, this EU ins ...

. But the EU has yet to specify with technical rigor how it would evaluate whether AIs are aligned or in compliance.

Dynamic nature of alignment

AI alignment is often perceived as a fixed objective, but some researchers argue it would be more appropriate to view alignment as an evolving process. One view is that AI technologies advance and human values and preferences change, alignment solutions must also adapt dynamically. Another is that alignment solutions need not adapt if researchers can create ''intent-aligned'' AI: AI that changes its behavior automatically as human intent changes. The first view would have several implications: * AI alignment solutions require continuous updating in response to AI advancements. A static, one-time alignment approach may not suffice. * Varying historical contexts and technological landscapes may necessitate distinct alignment strategies. This calls for a flexible approach and responsiveness to changing conditions. * The feasibility of a permanent, "fixed" alignment solution remains uncertain. This raises the potential need for continuous oversight of the AI-human relationship. * AI developers may have to continuously refine their ethical frameworks to ensure that their systems align with evolving human values. In essence, AI alignment may not be a static destination but rather an open, flexible process. Alignment solutions that continually adapt to ethical considerations may offer the most robust approach. This perspective could guide both effective policy-making and technical research in AI.

Footnotes

References

External links

Specification gaming examples in AI
vi
DeepMind
{{Existential risk from artificial intelligence, state=expanded AI safety Existential risk from artificial general intelligence Singularitarianism Philosophy of artificial intelligence Computational neuroscience Cybernetics Artificial intelligence Articles containing video clips