In machine learning, reinforcement learning from human feedback (RLHF) or reinforcement learning from human preferences is a technique that trains a "reward model" directly from human feedback and uses the model as a reward function to optimize an

agent Agent may refer to: Espionage, investigation, and law *, spies or intelligence officers * Law of agency, laws involving a person authorized to act on behalf of another ** Agent of record, a person with a contractual agreement with an insuranc ...

policy Policy is a deliberate system of guidelines to guide decisions and achieve rational outcomes. A policy is a statement of intent and is implemented as a procedure or protocol. Policies are generally adopted by a governance body within an organ ...

using reinforcement learning (RL) through an optimization algorithm like

Proximal Policy Optimization Proximal Policy Optimization (PPO) is a family of model-free reinforcement learning algorithms developed at OpenAI in 2017. PPO algorithms are policy gradient methods, which means that they search the space of policies rather than assigning va ...

. The reward model is trained in advance to the policy being optimized to predict if a given output is good (high reward) or bad (low reward). RLHF can improve the

robustness Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...

and exploration of RL agents, especially when the reward function is sparse or noisy. Human feedback is collected by asking humans to rank instances of the agent's behavior. These rankings can then be used to score outputs, for example with the Elo rating system. RLHF has been applied to various domains of natural language processing, such as conversational agents, text summarization, and natural language understanding. Ordinary reinforcement learning, where agents learn from their own actions based on a "reward function", is difficult to apply to

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...

tasks because the rewards are often not easy to define or measure, especially when dealing with complex tasks that involve human values or preferences. RLHF can enable language models to provide answers that align with these complex values, to generate more verbose responses, and to reject questions that are either inappropriate or outside the knowledge space of the model. Some examples of RLHF-trained language models are OpenAI's

ChatGPT ChatGPT (Generative Pre-trained Transformer) is a chatbot launched by OpenAI in November 2022. It is built on top of OpenAI's GPT-3 family of large language models, and is fine-tuned (an approach to transfer learning) with both supervised and ...

and its predecessor InstructGPT, as well as

DeepMind DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was List of mergers and acquisitions by Google, acquired by Google in 2014 and became a wholly owned subsid ...

Sparrow Sparrow may refer to: Birds * Old World sparrows, family Passeridae * New World sparrows, family Passerellidae * two species in the Passerine family Estrildidae: ** Java sparrow ** Timor sparrow * Hedge sparrow, also known as the dunnock or hedg ...

. RLHF has also been applied to other areas, such as the development of video game bots. For example, OpenAI and DeepMind trained agents to play

Atari Atari () is a brand name that has been owned by several entities since its inception in 1972. It is currently owned by French publisher Atari SA through a subsidiary named Atari Interactive. The original Atari, Inc. (1972–1992), Atari, Inc., ...

games based on human preferences. The agents achieved strong performance in many of the environments tested, often surpassing human performance.

Challenges and limitations

One major challenge of RLHF is the scalability and cost of human feedback, which can be slow and expensive compared to unsupervised learning. The quality and consistency of human feedback can also vary depending on the task, the interface, and the individual preferences of the humans. Even when human feedback is feasible, RLHF models may still exhibit undesirable behaviors that are not captured by human feedback or exploit loopholes in the reward model, which brings to light the challenges of

alignment Alignment may refer to: Archaeology * Alignment (archaeology), a co-linear arrangement of features or structures with external landmarks * Stone alignment, a linear arrangement of upright, parallel megalithic standing stones Biology * Structu ...

and

References

{{reflist Machine learning Reinforcement learning Language modeling Artificial intelligence

Challenges and limitations

See also

References