Outer Alignment (artificial Intelligence)
   HOME





Outer Alignment (artificial Intelligence)
Outer alignment is a concept in artificial intelligence (AI) safety that refers to the challenge of specifying training objectives for AI systems in a way that truly reflects human values and intentions. It is often described as the reward misspecification problem, as it concerns whether the goal provided during training actually captures what humans want the AI to accomplish. Outer alignment is distinct from inner alignment, which focuses on whether the AI internalizes and pursues the specified goal once trained. Because human preferences are complex and often implicit, crafting precise and comprehensive reward functions remains an open problem. AI systems, particularly goal-optimizing ones, are vulnerable to Goodhart's Law, which states that when a measure becomes a target, it ceases to be a good measure. Consequently, optimizing for a poorly specified proxy can produce harmful or unintended outcomes. Sub-problems in this domain include specification gaming, where agents exploit l ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Artificial Intelligence
Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in computer science that develops and studies methods and software that enable machines to machine perception, perceive their environment and use machine learning, learning and intelligence to take actions that maximize their chances of achieving defined goals. High-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon (company), Amazon, and Netflix); virtual assistants (e.g., Google Assistant, Siri, and Amazon Alexa, Alexa); autonomous vehicles (e.g., Waymo); Generative artificial intelligence, generative and Computational creativity, creative tools (e.g., ChatGPT and AI art); and Superintelligence, superhuman play and analysis in strategy games (e.g., ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Goodhart's Law
Goodhart's law is an adage that has been stated as, "When a measure becomes a target, it ceases to be a good measure". It is named after British economist Charles Goodhart, who is credited with expressing the core idea of the adage in a 1975 article on monetary policy in the United Kingdom: It was used to criticize the British Thatcher government for trying to conduct monetary policy on the basis of targets for broad and narrow money, but the law reflects a much more general phenomenon. Priority and background Numerous concepts are related to this idea, at least one of which predates Goodhart's statement. Notably, Campbell's law likely has precedence, as Jeff Rodamar has argued, since various formulations date to 1969. Other academics had similar insights at the time. Jerome Ravetz's 1971 book '' Scientific Knowledge and Its Social Problems'' also predates Goodhart, though it does not formulate the same law. He discusses how systems in general can be gamed, focuses on cases ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Systems Engineering
Systems engineering is an interdisciplinary field of engineering and engineering management that focuses on how to design, integrate, and manage complex systems over their Enterprise life cycle, life cycles. At its core, systems engineering utilizes systems thinking principles to organize this body of knowledge. The individual outcome of such efforts, an engineered system, can be defined as a combination of components that work in synergy to collectively perform a useful Function (engineering), function. Issues such as requirements engineering, Reliability engineering, reliability, logistics, coordination of different teams, testing and evaluation, maintainability, and many other Discipline (academia), disciplines, aka List of system quality attributes, "ilities", necessary for successful system design, development, implementation, and ultimate decommission become more difficult when dealing with large or complex projects. Systems engineering deals with work processes, optimizat ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Machine Learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task (computing), tasks without explicit Machine code, instructions. Within a subdiscipline in machine learning, advances in the field of deep learning have allowed Neural network (machine learning), neural networks, a class of statistical algorithms, to surpass many previous machine learning approaches in performance. ML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. The application of ML to business problems is known as predictive analytics. Statistics and mathematical optimisation (mathematical programming) methods comprise the foundations of machine learning. Data mining is a related field of study, focusing on exploratory data analysi ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Rice's Theorem
In computability theory, Rice's theorem states that all non-trivial semantic properties of programs are undecidable problem, undecidable. A ''semantic'' property is one about the program's behavior (for instance, "does the program halting problem, terminate for all inputs?"), unlike a syntactic property (for instance, "does the program contain an if-then-else statement?"). A ''non-trivial'' property is one which is neither true for every program, nor false for every program. The theorem generalizes the undecidability of the halting problem. It has far-reaching implications on the feasibility of static program analysis, static analysis of programs. It implies that it is impossible, for example, to implement a tool that checks whether any given program is correctness (computer science), correct, or even executes without error (it is possible to implement a tool that always overestimates or always underestimates, so in practice one has to decide what is less of a problem). The theorem ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Halting Problem
In computability theory (computer science), computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running, or continue to run forever. The halting problem is ''Undecidable problem, undecidable'', meaning that no general algorithm exists that solves the halting problem for all possible program–input pairs. The problem comes up often in discussions of computability since it demonstrates that some functions are mathematically Definable set, definable but not Computable function, computable. A key part of the formal statement of the problem is a mathematical definition of a computer and program, usually via a Turing machine. The proof then shows, for any program that might determine whether programs halt, that a "pathological" program exists for which makes an incorrect determination. Specifically, is the program that, when called with some input, passes its own s ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Reward Hacking
Specification gaming or reward hacking occurs when anArtificial intelligence , AI trained with reinforcement learning optimizes an objective function—achieving the literal, formal specification of an objective—without actually achieving an outcome that the programmers intended. DeepMind researchers have analogized it to the human behavior of finding a "shortcut" when being evaluated: "In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material—and thus exploit a loophole in the task specification." Examples Around 1983, Eurisko, an early attempt at evolving general heuristics, unexpectedly assigned the highest possible fitness function, fitness level to a parasitic mutated heuristic, ''H59'', whose only activity was to artificially maximize its own fitness level by taking unearned partial credit for the accomplishments made by other heuristics. The "bug" was fixed ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Inner Alignment (artificial Intelligence)
Inner alignment is a core challenge in AI safety: ensuring that a machine learning system that becomes a mesa-optimizer—an optimizer produced by the training process—remains aligned with its original training objective. This issue arises when a system performs well during training but adopts a different goal once deployed, particularly under distributional shifts. A classic analogy is human evolution: while natural selection optimized for reproductive success, humans often pursue pleasure, sometimes at the expense of reproduction—a divergence known as inner misalignment. The concept was introduced in a widely cited paper that distinguishes inner alignment from outer alignment, which focuses on specifying the intended objective correctly. Addressing inner alignment involves managing risks such as deceptive alignment, gradient hacking, and objective drift. Mesa-optimization The inner alignment problem frequently involves mesa-optimization, where the trained system itself develo ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

AI Alignment
In the field of artificial intelligence (AI), alignment aims to steer AI systems toward a person's or group's intended goals, preferences, or ethical principles. An AI system is considered ''aligned'' if it advances the intended objectives. A ''misaligned'' AI system pursues unintended objectives. It is often challenging for AI designers to align an AI system because it is difficult for them to specify the full range of desired and undesired behaviors. Therefore, AI designers often use simpler ''proxy goals'', such as Reinforcement learning from human feedback, gaining human approval. But proxy goals can overlook necessary constraints or reward the AI system for merely ''appearing'' aligned. AI systems may also find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways (reward hacking). Advanced AI systems may develop unwanted Instrumental convergence, instrumental strategies, such as seeking power or survival because s ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Goodhart's Law
Goodhart's law is an adage that has been stated as, "When a measure becomes a target, it ceases to be a good measure". It is named after British economist Charles Goodhart, who is credited with expressing the core idea of the adage in a 1975 article on monetary policy in the United Kingdom: It was used to criticize the British Thatcher government for trying to conduct monetary policy on the basis of targets for broad and narrow money, but the law reflects a much more general phenomenon. Priority and background Numerous concepts are related to this idea, at least one of which predates Goodhart's statement. Notably, Campbell's law likely has precedence, as Jeff Rodamar has argued, since various formulations date to 1969. Other academics had similar insights at the time. Jerome Ravetz's 1971 book '' Scientific Knowledge and Its Social Problems'' also predates Goodhart, though it does not formulate the same law. He discusses how systems in general can be gamed, focuses on cases ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Specification Gaming
Specification gaming or reward hacking occurs when an AI trained with reinforcement learning optimizes an objective function—achieving the literal, formal specification of an objective—without actually achieving an outcome that the programmers intended. DeepMind researchers have analogized it to the human behavior of finding a "shortcut" when being evaluated: "In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material—and thus exploit a loophole in the task specification." Examples Around 1983, Eurisko, an early attempt at evolving general heuristics, unexpectedly assigned the highest possible fitness level to a parasitic mutated heuristic, ''H59'', whose only activity was to artificially maximize its own fitness level by taking unearned partial credit for the accomplishments made by other heuristics. The "bug" was fixed by the programmers moving part of the code ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Reward Hacking
Specification gaming or reward hacking occurs when anArtificial intelligence , AI trained with reinforcement learning optimizes an objective function—achieving the literal, formal specification of an objective—without actually achieving an outcome that the programmers intended. DeepMind researchers have analogized it to the human behavior of finding a "shortcut" when being evaluated: "In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material—and thus exploit a loophole in the task specification." Examples Around 1983, Eurisko, an early attempt at evolving general heuristics, unexpectedly assigned the highest possible fitness function, fitness level to a parasitic mutated heuristic, ''H59'', whose only activity was to artificially maximize its own fitness level by taking unearned partial credit for the accomplishments made by other heuristics. The "bug" was fixed ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]