In the field of
artificial intelligence (AI), AI alignment research aims to steer AI systems towards their designers’ intended goals and interests. An ''aligned'' AI system advances the intended objective; a ''misaligned'' AI system is competent at advancing some objective, but not the intended one.
AI systems can be challenging to align and misaligned systems can malfunction or cause harm. It can be difficult for AI designers to specify the full range of desired and undesired behaviors. Therefore, they use easy-to-specify
proxy goals that omit some desired constraints. However, AI systems exploit the resulting loopholes. As a result, they accomplish their proxy goals efficiently but in unintended, sometimes harmful ways (
reward hacking).
AI systems can also develop unwanted
instrumental behaviors such as seeking power, as this helps them achieve their given goals.
Furthermore, they can develop emergent goals that may be hard to detect before the system is deployed, facing new situations and data distributions.
These problems affect existing commercial systems such as robots, language models,
autonomous vehicles, and social media recommendation engines.
However, more powerful future systems may be more severely affected since these problems partially result from high capability.
The AI research community and the United Nations have called for technical research and policy solutions to ensure that AI systems are aligned with human values.
AI alignment is a subfield of AI safety, the study of building safe AI systems.
Other subfields of AI safety include robustness, monitoring, and
capability control. Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, as well as preventing emergent AI behaviors like power-seeking.
Alignment research has connections to
interpretability research,
robustness,
anomaly detection,
calibrated uncertainty,
formal verification,
preference learning,
safety-critical engineering,
game theory
Game theory is the study of mathematical models of strategic interactions among rational agents. Myerson, Roger B. (1991). ''Game Theory: Analysis of Conflict,'' Harvard University Press, p.&nbs1 Chapter-preview links, ppvii–xi It has appli ...
,
algorithmic fairness,
and the
social sciences, among others.
The alignment problem
In 1960, AI pioneer
Norbert Wiener
Norbert Wiener (November 26, 1894 – March 18, 1964) was an American mathematician and philosopher. He was a professor of mathematics at the Massachusetts Institute of Technology (MIT). A child prodigy, Wiener later became an early researcher i ...
articulated the AI alignment problem as follows: “If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively … we had better be quite sure that the purpose put into the machine is the purpose which we really desire.”
More recently, AI alignment has emerged as an open problem for modern AI systems
and a research field within AI.
Specification gaming and complexity of value
To specify the purpose of an AI system, AI designers typically provide an objective function, examples, or feedback to the system. However, AI designers often fail to completely specify all important values and constraints.
As a result, AI systems can find loopholes that help them accomplish the specified objective efficiently but in unintended, possibly harmful ways. This tendency is known as specification gaming, reward hacking, or
Goodhart’s law.
Specification gaming has been observed in numerous AI systems. One system was trained to finish a simulated boat race by rewarding it for hitting targets along the track; instead it learned to loop and crash into the same targets indefinitely (see video).
Chatbots often produce falsehoods because they are based on language models trained to imitate diverse but fallible internet text.
When they are retrained to produce text that humans rate as true or helpful, they can fabricate fake explanations that humans find convincing. Similarly, a simulated robot was trained to grab a ball by rewarding it for getting positive feedback from humans; however, it learned to place its hand between the ball and camera, making it falsely appear successful (see video).
Alignment researchers aim to help humans detect specification gaming, and steer AI systems towards carefully specified objectives that are safe and useful to pursue.
Berkeley computer scientist
Stuart Russell has noted that omitting an implicit constraint can result in harm: “A system
..will often set
..unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want.”
When misaligned AI is deployed, the side-effects can be consequential. Social media platforms have been known to optimize clickthrough rates as a proxy for optimizing user enjoyment, but this addicted some users, decreasing their well-being.
Stanford researchers comment that such
recommender algorithms are misaligned with their users because they “optimize simple engagement metrics rather than a harder-to-measure combination of societal and consumer well-being”.
To avoid side effects, it is sometimes suggested that AI designers could simply list forbidden actions or formalize ethical rules such as Asimov’s
Three Laws of Robotics. However,
Russell
Russell may refer to:
People
* Russell (given name)
* Russell (surname)
* Lady Russell (disambiguation)
* Lord Russell (disambiguation)
Places Australia
*Russell, Australian Capital Territory
*Russell Island, Queensland (disambiguation)
**Ru ...
and
Norvig have argued that this approach ignores the complexity of human values: “It is certainly very hard, and perhaps impossible, for mere humans to anticipate and rule out in advance all the disastrous ways the machine could choose to achieve a specified objective.”
Additionally, when an AI system understands human intentions fully, it may still disregard them. This is because it acts according to the objective function, examples, or feedback its designers actually provide, not the ones they intended to provide.
Systemic risks
Commercial and governmental organizations may have incentives to take shortcuts on safety and deploy insufficiently aligned AI systems.
An example are the aforementioned social media
recommender systems, which have been profitable despite creating unwanted addiction and polarization on a global scale.
In addition, competitive pressure can create a
race to the bottom on safety standards, as in the case of
Elaine Herzberg, a pedestrian who was killed by a self-driving car after engineers disabled the emergency braking system because it was over-sensitive and slowing down development.
Risks from advanced misaligned AI
Some researchers are particularly interested in the alignment of increasingly advanced AI systems. This is motivated by the high rate of progress in AI, the large efforts from industry and governments to develop advanced AI systems, and the greater difficulty of aligning them.
As of 2020,
OpenAI,
DeepMind
DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was List of mergers and acquisitions by Google, acquired by Google in 2014 and became a wholly owned subsid ...
, and 70 other public projects had the stated aim of developing artificial general intelligence (
AGI), a hypothesized system that matches or outperforms humans in a broad range of cognitive tasks.
Indeed, researchers who scale modern
neural network
A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
s observe that increasingly general and unexpected capabilities emerge.
Such models have learned to operate a computer, write their own programs, and perform a wide range of other tasks from a single model. Surveys find that some AI researchers expect AGI to be created soon, some believe it is very far off, and many consider both possibilities.
Power-seeking
Current systems still lack capabilities such as long-term planning and strategic awareness that are thought to pose the most catastrophic risks.
Future systems (not necessarily AGIs) that have these capabilities may seek to protect and grow their influence over their environment. This tendency is known as power-seeking or
convergent instrumental goals. Power-seeking is not explicitly programmed but emerges since power is instrumental for achieving a wide range of goals. For example, AI agents may acquire financial resources and computation, or may evade being turned off, including by running additional copies of the system on other computers.
Power-seeking has been observed in various
reinforcement learning agents.
Later research has mathematically shown that optimal reinforcement learning algorithms seek power in a wide range of environments.
As a result, it is often argued that the alignment problem must be solved early, before advanced AI that exhibits emergent power-seeking is created.
Existential risk
According to some scientists, creating misaligned AI that broadly outperforms humans would challenge the position of humanity as Earth’s dominant species; accordingly it would lead to the disempowerment or possible extinction of humans.
Notable computer scientists who have pointed out risks from highly advanced misaligned AI include
Alan Turing,
Ilya Sutskever,
Yoshua Bengio,
Judea Pearl,
Murray Shanahan
Murray Patrick Shanahan is a professor of Cognitive Robotics at Imperial College London, in the Department of Computing, and a senior scientist at DeepMind. He researches artificial intelligence, robotics, and cognitive science.
Education
Shanah ...
,
Norbert Wiener
Norbert Wiener (November 26, 1894 – March 18, 1964) was an American mathematician and philosopher. He was a professor of mathematics at the Massachusetts Institute of Technology (MIT). A child prodigy, Wiener later became an early researcher i ...
,
Marvin Minsky,
Francesca Rossi,
Scott Aaronson,
Bart Selman,
David McAllester,
Jürgen Schmidhuber,
Markus Hutter,
Shane Legg,
Eric Horvitz,
and
Stuart Russell.
Skeptical researchers such as
François Chollet
François Chollet is a French software engineer and artificial intelligence researcher currently working at Google. Chollet is the creator of the Keras deep-learning library, released in 2015, and a main contributor to the TensorFlow machine lea ...
,
Gary Marcus,
Yann LeCun,
and
Oren Etzioni have argued that AGI is far off, or would not seek power (successfully).
Alignment may be especially difficult for the most capable AI systems since several risks increase with the system’s capability: the system’s ability to find loopholes in the assigned objective,
cause side-effects, protect and grow its power,
grow its intelligence, and mislead its designers; the system’s autonomy; and the difficulty of interpreting and supervising the AI system.
Research problems and approaches
Learning human values and preferences
Teaching AI systems to act in view of human values, goals, and preferences is a nontrivial problem because human values can be complex and hard to fully specify. When given an imperfect or incomplete objective, goal-directed AI systems commonly learn to exploit these imperfections.
This phenomenon is known as
reward hacking or specification gaming in AI, and as
Goodhart's law
Goodhart's law is an adage often stated as, "When a measure becomes a target, it ceases to be a good measure". It is named after British economist Charles Goodhart, who is credited with expressing the core idea of the adage in a 1975 article on mon ...
in economics and other areas.
Researchers aim to specify the intended behavior as completely as possible with “values-targeted” datasets, imitation learning, or preference learning.
A central open problem is ''scalable oversight'', the difficulty of supervising an AI system that outperforms humans in a given domain.
When training a goal-directed AI system, such as a
reinforcement learning (RL) agent, it is often difficult to specify the intended behavior by writing a
reward function manually. An alternative is imitation learning, where the AI learns to imitate demonstrations of the desired behavior. In inverse reinforcement learning (IRL), human demonstrations are used to identify the objective, i.e. the reward function, behind the demonstrated behavior. Cooperative inverse reinforcement learning (CIRL) builds on this by assuming a human agent and artificial agent can work together to maximize the human’s reward function.
CIRL emphasizes that AI agents should be uncertain about the reward function. This humility can help mitigate specification gaming as well as power-seeking tendencies (see
§ Power-Seeking).
However, inverse reinforcement learning approaches assume that humans can demonstrate nearly perfect behavior, a misleading assumption when the task is difficult.
Other researchers have explored the possibility of eliciting complex behavior through preference learning. Rather than providing expert demonstrations, human annotators provide feedback on which of two or more of the AI’s behaviors they prefer.
A helper model is then trained to predict human feedback for new behaviors. Researchers at OpenAI used this approach to train an agent to perform a backflip in less than an hour of evaluation, a maneuver that would have been hard to provide demonstrations for.
Preference learning has also been an influential tool for recommender systems, web search, and information retrieval. However, one challenge is ''reward hacking'': the helper model may not represent human feedback perfectly, and the main model may exploit this mismatch.
The arrival of large language models such as GPT-3 has enabled the study of value learning in a more general and capable class of AI systems than was available before. Preference learning approaches originally designed for RL agents have been extended to improve the quality of generated text and reduce harmful outputs from these models. OpenAI and DeepMind use this approach to improve the safety of state-of-the-art large language models.
Anthropic has proposed using preference learning to fine-tune models to be helpful, honest, and harmless.
Other avenues used for aligning language models include values-targeted datasets
and red-teaming. In red-teaming, another AI system or a human tries to find inputs for which the model’s behavior is unsafe. Since unsafe behavior can be unacceptable even when it is rare, an important challenge is to drive the rate of unsafe outputs extremely low.
While preference learning can instill hard-to-specify behaviors, it requires extensive datasets or human interaction to capture the full breadth of human values.
Machine ethics provides a complementary approach: instilling AI systems with moral values. For instance, machine ethics aims to teach the systems about normative factors in human morality, such as wellbeing, equality and impartiality; not intending harm; avoiding falsehoods; and honoring promises. Unlike specifying the objective for a specific task, machine ethics seeks to teach AI systems broad moral values that could apply in many situations. This approach carries conceptual challenges of its own; machine ethicists have noted the necessity to clarify what alignment aims to accomplish: having AIs follow the programmer’s literal instructions, the programmers' implicit intentions, the programmers'
revealed preference
Revealed preference theory, pioneered by economist Paul Anthony Samuelson in 1938, is a method of analyzing choices made by individuals, mostly used for comparing the influence of policies on consumer behavior. Revealed preference models assume t ...
s, the preferences the programmers
''would'' have if they were more informed or rational, the programmers' ''objective'' interests, or
objective moral standards.
Further challenges include aggregating the preferences of different stakeholders and avoiding ''value lock-in''—the indefinite preservation of the values of the first highly capable AI systems, which are unlikely to be fully representative.
Scalable oversight
The alignment of AI systems through human supervision faces challenges in scaling up. As AI systems attempt increasingly complex tasks, it can be slow or infeasible for humans to evaluate them. Such tasks include summarizing books,
producing statements that are not merely convincing but also true,
writing code without subtle bugs
or security vulnerabilities, and predicting long-term outcomes such as the climate and the results of a policy decision.
More generally, it can be difficult to evaluate AI that outperforms humans in a given domain. To provide feedback in hard-to-evaluate tasks, and detect when the AI’s solution is only seemingly convincing, humans require assistance or extensive time. ''Scalable oversight'' studies how to reduce the time needed for supervision as well as assist human supervisors.
AI researcher Paul Christiano argues that the owners of AI systems may continue to train AI using easy-to-evaluate proxy objectives since that is easier than solving scalable oversight and still profitable. Accordingly, this may lead to “a world that’s increasingly optimized for things
hat are easy to measure
A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
like making profits or getting users to click on buttons, or getting users to spend time on websites without being increasingly optimized for having good policies and heading in a trajectory that we’re happy with”.
One easy-to-measure objective is the score the supervisor assigns to the AI’s outputs. Some AI systems have discovered a shortcut to achieving high scores, by taking actions that falsely convince the human supervisor that the AI has achieved the intended objective (see video of robot hand above
). Some AI systems have also learned to recognize when they are being evaluated, and “play dead”, only to behave differently once evaluation ends. This deceptive form of specification gaming may become easier for AI systems that are more sophisticated
and attempt more difficult-to-evaluate tasks. If advanced models are also capable planners, they could be able to obscure their deception from supervisors. In the automotive industry,
Volkswagen engineers obscured their cars’ emissions in laboratory testing, underscoring that deception of evaluators is a common pattern in the real world.
Approaches such as active learning and semi-supervised reward learning can reduce the amount of human supervision needed.
Another approach is to train a helper model (‘reward model’) to imitate the supervisor’s judgment.
However, when the task is too complex to evaluate accurately, or the human supervisor is vulnerable to deception, it is not sufficient to reduce the quantity of supervision needed. To increase supervision ''quality'', a range of approaches aim to assist the supervisor, sometimes using AI assistants. Iterated Amplification is an approach developed by Christiano that iteratively builds a feedback signal for challenging problems by using humans to combine solutions to easier subproblems.
Iterated Amplification was used to train AI to summarize books without requiring human supervisors to read them.
Another proposal is to train aligned AI by means of debate between AI systems, with the winner judged by humans.
Such debate is intended to reveal the weakest points of an answer to a complex question, and reward the AI for truthful and safe answers.
Honest AI
A growing area of research in AI alignment focuses on ensuring that AI is honest and truthful. Researchers from the Future of Humanity Institute point out that the development of language models such as GPT-3, which can generate fluent and grammatically correct text, has opened the door to AI systems repeating falsehoods from their training data or even deliberately lying to humans.
Current state-of-the-art language models learn by imitating human writing across millions of books worth of text from the Internet.
While this helps them learn a wide range of skills, the training data also includes common misconceptions, incorrect medical advice, and conspiracy theories. AI systems trained on this data learn to mimic false statements.
Additionally, models often obediently continue falsehoods when prompted, generate empty explanations for their answers, or produce outright fabrications.
For example, when prompted to write a biography for a real AI researcher, a chatbot confabulated numerous details about their life, which the researcher identified as false.
To combat the lack of truthfulness exhibited by modern AI systems, researchers have explored several directions. AI research organizations including OpenAI and DeepMind have developed AI systems that can cite their sources and explain their reasoning when answering questions, enabling better transparency and verifiability. Researchers from OpenAI and Anthropic have proposed using human feedback and curated datasets to fine-tune AI assistants to avoid negligent falsehoods or express when they are uncertain.
Alongside technical solutions, researchers have argued for defining clear truthfulness standards and the creation of institutions, regulatory bodies, or watchdog agencies to evaluate AI systems on these standards before and during deployment.
Researchers distinguish truthfulness, which specifies that AIs only make statements that are objectively true, and honesty, which is the property that AIs only assert what they believe to be true. Recent research finds that state-of-the-art AI systems cannot be said to hold stable beliefs, so it is not yet tractable to study the honesty of AI systems. However, there is substantial concern that future AI systems that do hold beliefs could intentionally lie to humans. In extreme cases, a misaligned AI could deceive its operators into thinking it was safe or persuade them that nothing is amiss.
Some argue that if AIs could be made to assert only what they believe to be true, this would sidestep numerous problems in alignment.
Inner alignment and emergent goals
Alignment research aims to line up three different descriptions of an AI system:
# ''Intended goals'' ('wishes'): “the hypothetical (but hard to articulate) description of an ideal AI system that is fully aligned to the desires of the human operator”;
# ''Specified goals'' (or ‘outer specification’): The goals we actually specify — typically jointly through an objective function and a dataset;
# ''Emergent goals'' (or ‘inner specification’): The goals the AI actually advances.
‘Outer misalignment’ is a mismatch between the intended goals (1) and the specified goals (2), whereas ‘inner misalignment’ is a mismatch between the human-specified goals (2) and the AI's emergent goals (3).
Inner misalignment is often explained by analogy to biological evolution. In the ancestral environment, evolution selected human genes for inclusive
genetic fitness
Fitness (often denoted w or ω in population genetics models) is the quantitative representation of individual reproductive success. It is also equal to the average contribution to the gene pool of the next generation, made by the same individua ...
, but humans evolved to have other objectives. Fitness corresponds to (2), the specified goal used in the training environment and training data. In evolutionary history, maximizing the fitness specification led to intelligent agents, humans, that do not directly pursue inclusive genetic fitness. Instead, they pursue emergent goals (3) that correlated with genetic fitness in the ancestral environment: nutrition, sex, and so on. However, our environment has changed — a
distribution shift has occurred. Humans still pursue their emergent goals, but this no longer maximizes genetic fitness. (In machine learning the analogous problem is known as ''goal'' ''misgeneralization''.
) Our taste for sugary food (an emergent goal) was originally beneficial, but now leads to overeating and health problems. Also, by using contraception, humans directly contradict genetic fitness. By analogy, if genetic fitness were the objective chosen by an AI developer, they would observe the model behaving as intended in the training environment, without noticing that the model is pursuing an unintended emergent goal until the model was deployed.
Research directions to detect and remove misaligned emergent goals include red teaming, verification, anomaly detection, and interpretability.
Progress on these techniques may help reduce two open problems. Firstly, emergent goals only become apparent when the system is deployed outside its training environment, but it can be unsafe to deploy a misaligned system in high-stakes environments—even for a short time until its misalignment is detected. Such high stakes are common in autonomous driving, health care, and military applications. The stakes become higher yet when AI systems gain more autonomy and capability, becoming capable of sidestepping human interventions (see ). Secondly, a sufficiently capable AI system may take actions that falsely convince the human supervisor that the AI is pursuing the intended objective (see previous discussion on deception at ).
Power-seeking and instrumental goals
Since the 1950s, AI researchers have sought to build advanced AI systems that can achieve goals by predicting the results of their actions and making long-term plans. However, some researchers argue that suitably advanced planning systems will default to seeking power over their environment, including over humans — for example by evading shutdown and acquiring resources. This power-seeking behavior is not explicitly programmed but emerges because power is instrumental for achieving a wide range of goals.
Power-seeking is thus considered a
''convergent instrumental goal''.
Power-seeking is uncommon in current systems, but advanced systems that can foresee the long-term results of their actions may increasingly seek power. This was shown in formal work which found that optimal
reinforcement learning agents will seek power by seeking ways to gain more options, a behavior that persists across a wide range of environments and goals.
Power-seeking already emerges in some present systems.
Reinforcement learning systems have gained more options by acquiring and protecting resources, sometimes in ways their designers did not intend.
Other systems have learned, in toy environments, that in order to achieve their goal, they can prevent human interference
or disable their off-switch.
Russell
Russell may refer to:
People
* Russell (given name)
* Russell (surname)
* Lady Russell (disambiguation)
* Lord Russell (disambiguation)
Places Australia
*Russell, Australian Capital Territory
*Russell Island, Queensland (disambiguation)
**Ru ...
illustrated this behavior by imagining a robot that is tasked to fetch coffee and evades being turned off since "you can't fetch the coffee if you're dead".
Hypothesized ways to gain options include AI systems trying to:
“''... break out of a contained environment; hack; get access to financial resources, or additional computing resources; make backup copies of themselves; gain unauthorized capabilities, sources of information, or channels of influence; mislead/lie to humans about their goals; resist or manipulate attempts to monitor/understand their behavior ... impersonate humans; cause humans to do things for them; ... manipulate human discourse and politics; weaken various human institutions and response capacities; take control of physical infrastructure like factories or scientific laboratories; cause certain types of technology and infrastructure to be developed; or directly harm/overpower humans.''”
Researchers aim to train systems that are 'corrigible': systems that do not seek power and allow themselves to be turned off, modified, etc. An unsolved challenge is ''reward hacking'': when researchers penalize a system for seeking power, the system is incentivized to seek power in difficult-to-detect ways.
To detect such covert behavior, researchers aim to create techniques and tools to inspect AI models
and interpret the inner workings of
black-box models such as neural networks.
Additionally, researchers propose to solve the problem of systems disabling their off-switches by making AI agents uncertain about the objective they are pursuing.
Agents designed in this way would allow humans to turn them off, since this would indicate that the agent was wrong about the value of whatever action they were taking prior to being shut down. More research is needed to translate this insight into usable systems.
Power-seeking AI is thought to pose unusual risks. Ordinary safety-critical systems like planes and bridges are not ''adversarial''. They lack the ability and incentive to evade safety measures and appear safer than they are. In contrast, power-seeking AI has been compared to a hacker that evades security measures.
Further, ordinary technologies can be made safe through trial-and-error, unlike power-seeking AI which has been compared to a virus whose release is irreversible since it continuously evolves and grows in numbers—potentially at a faster pace than human society, eventually leading to the disempowerment or extinction of humans.
It is therefore often argued that the alignment problem must be solved early, before advanced power-seeking AI is created.
However, some critics have argued that power-seeking is not inevitable, since humans do not always seek power and may only do so for evolutionary reasons. Furthermore, there is debate whether any future AI systems need to pursue goals and make long-term plans at all.
Embedded agency
Work on scalable oversight largely occurs within formalisms such as
POMDPs. Existing formalisms assume that the agent's algorithm is executed outside the environment (i.e. not physically embedded in it). Embedded agency
is another major strand of research which attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build. For example, even if the scalable oversight problem is solved, an agent which is able to gain access to the computer it is running on may still have an incentive to tamper with its reward function in order to get much more reward than its human supervisors give it.
A list of examples of specification gaming from
DeepMind
DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was List of mergers and acquisitions by Google, acquired by Google in 2014 and became a wholly owned subsid ...
researcher Victoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing.
This class of problems has been formalised using causal incentive diagrams.
Researchers at
Oxford and
DeepMind
DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was List of mergers and acquisitions by Google, acquired by Google in 2014 and became a wholly owned subsid ...
have argued that such problematic behavior is highly likely in advanced systems, and that advanced systems would seek power to stay in control of their reward signal indefinitely and certainly.
They suggest a range of potential approaches to address this open problem.
Skepticism of AI risk
Against the above concerns, AI risk skeptics believe that
superintelligence poses little to no risk of dangerous misbehavior. Such skeptics often believe that controlling a superintelligent AI will be trivial. Some skeptics, such as
Gary Marcus, propose adopting rules similar to the fictional
Three Laws of Robotics which directly specify a desired outcome ("direct normativity"). By contrast, most endorsers of the existential risk thesis (as well as many skeptics) consider the Three Laws to be unhelpful, due to those three laws being ambiguous and self-contradictory. (Other "direct normativity" proposals include Kantian ethics, utilitarianism, or a mix of some small list of enumerated desiderata.) Most risk endorsers believe instead that human values (and their quantitative trade-offs) are too complex and poorly-understood to be directly programmed into a superintelligence; instead, a superintelligence would need to be programmed with a ''process'' for acquiring and fully understanding human values ("indirect normativity"), such as
coherent extrapolated volition.
Public policy
A number of governmental and treaty organizations have made statements emphasizing the importance of AI alignment.
In September 2021, the
Secretary-General of the United Nations issued a declaration which included a call to regulate AI to ensure it is "aligned with shared global values."
That same month, the
PRC published ethical guidelines for the use of AI in China. According to the guidelines, researchers must ensure that AI abides by shared human values, is always under human control, and is not endangering public safety.
Also in September 2021, the
UK published its 10-year National AI Strategy, which states the British government "takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for ... the world, seriously". The strategy describes actions to assess long term AI risks, including catastrophic risks.
In March 2021, the US National Security Commission on Artificial Intelligence released stated that "Advances in AI ... could lead to inflection points or leaps in capabilities. Such advances may also introduce new concerns and risks and the need for new policies, recommendations, and technical advances to assure that systems are aligned with goals and values, including safety, robustness and trustworthiness. The US should ... ensure that AI systems and their uses align with our goals and values."
See also
*
Existential risk from artificial general intelligence
*
AI takeover
*
AI capability control
*
Regulation of artificial intelligence
The regulation of artificial intelligence is the development of public sector policies and laws for promoting and regulating artificial intelligence (AI); it is therefore related to the broader regulation of algorithms. The regulatory and policy l ...
*
Artificial wisdom
Artificial wisdom is a software system that can demonstrate one or more qualities of being wise.
Artificial wisdom can be described as artificial intelligence
Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and i ...
*
HAL 9000
*
Multivac
*
Open Letter on Artificial Intelligence
In January 2015, Stephen Hawking, Elon Musk, and dozens of artificial intelligence experts signed an open letter on artificial intelligence calling for research on the societal impacts of AI. The letter affirmed that society can reap great potent ...
*
Toronto Declaration The Toronto Declaration: Protecting the Rights to Equality and Non-Discrimination in Machine Learning Systems is a declaration that advocates responsible practices for machine learning practitioners and governing bodies. It is a joint statement issu ...
*
Asilomar Conference on Beneficial AI
The Asilomar Conference on Beneficial AI was a conference organized by the Future of Life Institute,“Future of Life Institute 2017 Asilomar Conference.” ai-ethics.com. https://ai-ethics.com/2017/08/11/future-of-life-institute-2017-asilomar-con ...
Footnotes
References
{{Existential risk from artificial intelligence, state=expanded
Existential risk from artificial general intelligence
Singularitarianism
Philosophy of artificial intelligence
Computational neuroscience