Mesa-optimization
The inner alignment problem frequently involves mesa-optimization, where the trained system itself develops the capability to optimize for its own objectives. During training with techniques likeDistinction from outer alignment
The distinction between inner and outer alignment was formalized in the paper ''Risks from Learned Optimization in Advanced Machine Learning Systems''. Outer alignment refers to ensuring that the training objective—also called the "outer objective", such as the loss function in supervised learning—correctly captures the goals and values intended by human designers. In contrast, inner alignment concerns whether the trained model actually pursues a goal that aligns with this outer objective. While the outer objective is explicitly specified, the inner objective is typically implicit and emerges from the training dynamics. This means that a model can perform well during training—appearing aligned—yet internally optimize for a different goal once deployed, especially in novel contexts or under distributional shifts. This risk is compounded by the fact that current machine learning systems do not provide a clear or transparent representation of their internal objectives. Consequently, significant misalignment can arise without being detected during development. This divergence is also explored in research on goal misgeneralization, which highlights how models can generalize their learned behavior in unintended ways that reflect internal goals differing from those specified by the outer training signal.Practical illustrations
One well-known illustration involves a maze-solving AI trained on environments where the solution is marked with a green arrow. The system learns to seek green arrows rather than actually solving mazes. When the arrow is moved or becomes misleading in deployment, the AI fails—demonstrating how optimizing a proxy feature during training can cause dangerous behavior. Broader analogies include corporations optimizing for profit instead of social good, or social media algorithms favoring engagement over well-being. Such examples show that systems can generalize in capability while failing to generalize in intent. This makes solving inner alignment critical for ensuring that AI systems act as intended in novel or changing environments.Definitional ambiguity
The meaning of inner alignment has been interpreted in multiple ways, leading to differing taxonomies of alignment failures. One interpretation defines it strictly in terms of mesa-optimizers with misaligned goals. Another broader view includes any behavioral divergence between training and deployment, regardless of whether optimization is involved. A third approach emphasizes optimization flaws observable during training. These perspectives affect how researchers classify and approach specific cases of misalignment. Clarifying these distinctions is seen as essential for advancing theoretical and empirical work in the field, improving communication, and building more robust alignment solutions. Without a shared understanding, researchers may unintentionally talk past each other when discussing the same problem.Strategic importance
There is a growing sense of urgency around solving inner alignment, especially as advanced AI systems approach general-purpose capabilities. Misalignment has already been observed in deployed systems—for example, recommendation algorithms optimizing for engagement rather than user well-being. Even seemingly minor misbehaviors, such asAlternative framings
Several framings of the inner alignment problem have been proposed to clarify the conceptual boundaries between types of misalignment. One framing focuses on behavioral divergence in test environments: failures that arise due to bad training signals are classified as outer misalignment, while failures due to internal misgeneralized goals are classified as inner. A second framing considers the causal source of the failure—whether it stems from the reward function or the inductive biases of the training method. Another framing shifts to cognitive alignment, analyzing whether the AI’s internal goals match human values. A final framing considers alignment during continual learning, where models may evolve their goals post-deployment. Each approach highlights different risks and informs different research agendas.See also
* Outer alignment *References
{{reflist AI safety