Policy Gradient

	Policy Gradient Policy gradient methods are a class of reinforcement learning algorithms. Policy gradient methods are a sub-class of policy optimization methods. Unlike value-based methods which learn a value function to derive a policy, policy optimization methods directly learn a policy function \pi that selects actions without consulting a value function. For policy gradient to apply, the policy function \pi_\theta is parameterized by a differentiable parameter \theta. Overview In policy-based RL, the actor is a parameterized policy function \pi_\theta, where \theta are the parameters of the actor. The actor takes as argument the state of the environment s and produces a probability distribution \pi_\theta(\cdot \mid s). If the action space is discrete, then \sum_ \pi_\theta(a \mid s) = 1. If the action space is continuous, then \int_ \pi_\theta(a \mid s) \mathrma = 1. The goal of policy optimization is to find some \theta that maximizes the expected episodic reward J(\theta):J(\theta) = ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Reinforcement Learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning. Reinforcement learning differs from supervised learning in not needing labelled input-output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) with the goal of maximizing the cumulative reward (the feedback of which might be incomplete or delayed). The search for this balance is known as the exploration–exploitation dilemma. The environment is typically stated in the form of a Markov decision process (MDP), as many reinforcement learning algorithms use dyn ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Quadratic Programming Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constraints on the variables. Quadratic programming is a type of nonlinear programming. "Programming" in this context refers to a formal procedure for solving mathematical problems. This usage dates to the 1940s and is not specifically tied to the more recent notion of "computer programming." To avoid confusion, some practitioners prefer the term "optimization" — e.g., "quadratic optimization." Problem formulation The quadratic programming problem with variables and constraints can be formulated as follows. Given: * a real-valued, -dimensional vector , * an -dimensional real symmetric matrix , * an -dimensional real matrix , and * an -dimensional real vector , the objective of quadratic programming is to find an -dimensional vector , that ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Deep Reinforcement Learning {{Short description, Subfield of machine learning Deep reinforcement learning (DRL) is a subfield of machine learning that combines principles of reinforcement learning (RL) and deep learning. It involves training agents to make decisions by interacting with an environment to maximize cumulative rewards, while using deep neural networks to represent policies, value functions, or environment models. This integration enables DRL systems to process high-dimensional inputs, such as images or continuous control signals, making the approach effective for solving complex tasks. Since the introduction of the deep Q-network (DQN) in 2015, DRL has achieved significant successes across domains including games, robotics, and autonomous systems, and is increasingly applied in areas such as healthcare, finance, and autonomous vehicles. Deep reinforcement learning Introduction Deep reinforcement learning (DRL) is part of machine learning, which combines reinforcement learning (RL) and deep ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Reinforcement Learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning. Reinforcement learning differs from supervised learning in not needing labelled input-output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) with the goal of maximizing the cumulative reward (the feedback of which might be incomplete or delayed). The search for this balance is known as the exploration–exploitation dilemma. The environment is typically stated in the form of a Markov decision process (MDP), as many reinforcement learning algorithms use dyn ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	DeepSeek Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd., Trade name, doing business as DeepSeek, is a Chinese artificial intelligence company that develops large language models (LLMs). Based in Hangzhou, Zhejiang, Deepseek is owned and funded by the Chinese hedge fund High-Flyer. DeepSeek was founded in July 2023 by Liang Wenfeng, the co-founder of High-Flyer, who also serves as the Chief executive officer, CEO for both companies. The company launched DeepSeek (chatbot), an eponymous chatbot alongside its DeepSeek-R1 model in January 2025. Released under the MIT License, DeepSeek-R1 provides responses comparable to other contemporary large language models, such as OpenAI's GPT-4 and OpenAI o1, o1. Its training cost was reported to be significantly lower than other LLMs. The company claims that it trained its V3 model for US$6 million—far less than the US$100 million cost for OpenAI's GPT-4 in 2023—and using approximately one-tenth the comput ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Reasoning Language Model Reasoning language models (RLMs) are large language models that have been further trained to solve multi-step reasoning tasks. These models perform better on logical, mathematical or programmatic tasks than traditional autoregressive LLMs, have the ability to backtrack, and employ test-time compute as an additional scaling axis beyond training examples, parameter count, and train-time compute. History 2024 o1-preview, an LLM with enhanced reasoning, was released in September 2024. The full version, o1, followed in December 2024. OpenAI also began sharing results on its successor, o3. The development of reasoning LLMs has illustrated what Rich Sutton termed the "bitter lesson": that general methods leveraging computation often outperform those relying on specific human insights. For instance, some research groups, such as the Generative AI Research Lab (GAIR), initially explored complex techniques like tree search and reinforcement learning in attempts to replicate o1's c ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Standard Score In statistics, the standard score or ''z''-score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores above the mean have positive standard scores, while those below the mean have negative standard scores. It is calculated by subtracting the population mean from an individual raw score and then dividing the difference by the Statistical population, population standard deviation. This process of converting a raw score into a standard score is called standardizing or normalizing (however, "normalizing" can refer to many types of ratios; see ''Normalization (statistics), Normalization'' for more). Standard scores are most commonly called ''z''-scores; the two terms may be used interchangeably, as they are in this article. Other equivalent terms in use include z-value, z-statistic, normal score, standardized variable and pull in high energy ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	F-divergence In probability theory, an f-divergence is a certain type of function D_f(P\, Q) that measures the difference between two probability distributions P and Q. Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of f-divergence. History These divergences were introduced by Alfréd Rényi in the same paper where he introduced the well-known Rényi entropy. He proved that these divergences decrease in Markov Process, Markov processes. ''f''-divergences were studied further independently by , and and are sometimes known as Csiszár f-divergences, Csiszár–Morimoto divergences, or Ali–Silvey distances. Definition Non-singular case Let P and Q be two probability distributions over a space \Omega, such that P\ll Q, that is, P is Absolute continuity#Absolute continuity of measures, absolutely continuous with respect to Q (meaning Q>0 wherever P>0). Then, for a convex function f: [0, +\infty)\to(-\infty, +\infty] ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Reinforcement Learning From Human Feedback In machine learning, reinforcement learning from human feedback (RLHF) is a technique to AI alignment, align an intelligent agent with human preferences. It involves training a reward model to represent preferences, which can then be used to train other models through reinforcement learning. In classical reinforcement learning, an intelligent agent's goal is to learn a function that guides its behavior, called a reinforcement learning#Policy, policy. This function is iteratively updated to maximize rewards based on the agent's task performance. However, explicitly defining a reward function that accurately approximates human preferences is challenging. Therefore, RLHF seeks to train a "reward model" directly from human feedback. The reward model is first trained in a supervised learning, supervised manner to predict if a response to a given prompt is good (high reward) or bad (low reward) based on ranking data collected from human labeled data, annotators. This model then serves a ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Reasoning Language Model Reasoning language models (RLMs) are large language models that have been further trained to solve multi-step reasoning tasks. These models perform better on logical, mathematical or programmatic tasks than traditional autoregressive LLMs, have the ability to backtrack, and employ test-time compute as an additional scaling axis beyond training examples, parameter count, and train-time compute. History 2024 o1-preview, an LLM with enhanced reasoning, was released in September 2024. The full version, o1, followed in December 2024. OpenAI also began sharing results on its successor, o3. The development of reasoning LLMs has illustrated what Rich Sutton termed the "bitter lesson": that general methods leveraging computation often outperform those relying on specific human insights. For instance, some research groups, such as the Generative AI Research Lab (GAIR), initially explored complex techniques like tree search and reinforcement learning in attempts to replicate o1's c ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Adam Optimizer Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high Computational complexity, computational burden, achieving faster iterations in exchange for a lower Rate of convergence, convergence rate. The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s. Today, stochastic gradient descent has become an important optimization method in machine learning. Background Both statistics, statistical M-estimation, estimation and ma ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]