The free energy principle is a mathematical principle in biophysics and cognitive science that provides a formal account of the representational capacities of physical systems: that is, why things that exist look as if they track properties of the systems to which they are coupled. It establishes that the dynamics of physical systems minimise a quantity known as surprisal (which is just the negative log probability of some outcome); or equivalently, its variational upper bound, called free energy. The principle is formally related to variational Bayesian methods and was originally introduced by

Karl Friston Karl John Friston FRS FMedSci FRSB (born 12 July 1959) is a British neuroscientist and theoretician at University College London. He is an authority on brain imaging and theoretical neuroscience, especially the use of physics-inspired stati ...

as an explanation for embodied perception-action loops in

neuroscience Neuroscience is the science, scientific study of the nervous system (the brain, spinal cord, and peripheral nervous system), its functions and disorders. It is a Multidisciplinary approach, multidisciplinary science that combines physiology, an ...

, where it is also known as active inference. The free energy principle models the behaviour of systems that are distinct from, but coupled to, another system (e.g., an embedding environment), where the degrees of freedom that implement the interface between the two systems is known as a Markov blanket. More formally, the free energy principle says that if a system has a "particular partition" (i.e., into particles, with their Markov blankets), then subsets of that system will track the statistical structure of other subsets (which are known as internal and external states or paths of a system). The free energy principle is based on the Bayesian idea of the brain as an “ inference engine.” Under the free energy principle, systems pursue paths of least surprise, or equivalently, minimize the difference between predictions based on their model of the world and their

sense A sense is a biological system used by an organism for sensation, the process of gathering information about the world through the detection of stimuli. (For example, in the human body, the brain which is part of the central nervous system re ...

and associated

perception Perception () is the organization, identification, and interpretation of sensory information in order to represent and understand the presented information or environment. All perception involves signals that go through the nervous syste ...

. This difference is quantified by variational free energy and is minimized by continuous correction of the world model of the system, or by making the world more like the predictions of the system. By actively changing the world to make it closer to the expected state, systems can also minimize the free energy of the system. Friston assumes this to be the principle of all biological reaction.Shaun Raviv
The Genius Neuroscientist Who Might Hold the Key to True AI
In: Wired, 13. November 2018 Friston also believes his principle applies to

mental disorder A mental disorder, also referred to as a mental illness or psychiatric disorder, is a behavioral or mental pattern that causes significant distress or impairment of personal functioning. Such features may be persistent, relapsing and remitti ...

s as well as to

artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech ...

. AI implementations based on the active inference principle have shown advantages over other methods. Although challenging even for experts, the free energy principle is ultimately quite simple and fundamental, and can be re-derived from conventional mathematics following maximum entropy inference. Indeed, it can be shown that any large enough random dynamical system will display the kind of boundary that allows one to apply the free energy principle to model its dynamics: the probability of finding a Markov blanket in the underlying potential of the system (and therefore, being able to apply the free energy principle) goes to 100% as the size of the system goes to infinity. The free energy principle is a mathematical principle of information physics: much like the principle of maximum entropy or the principle of least action, it is true on mathematical grounds. To attempt to falsify the free energy principle is a category mistake, akin to trying to falsify

calculus Calculus, originally called infinitesimal calculus or "the calculus of infinitesimals", is the mathematics, mathematical study of continuous change, in the same way that geometry is the study of shape, and algebra is the study of generalizati ...

by making empirical observations. (One cannot invalidate a mathematical theory in this way; instead, one would need to derive a formal contradiction from the theory.) In a 2018 interview, Friston explained what it entails for the free energy principle to not be subject to falsification: "the free energy principle is what it is — a

principle A principle is a proposition or value that is a guide for behavior or evaluation. In law, it is a rule that has to be or usually is to be followed. It can be desirably followed, or it can be an inevitable consequence of something, such as the l ...

. Like Hamilton's principle of stationary action, it cannot be falsified. It cannot be disproven. In fact, there’s not much you can do with it, unless you ask whether measurable systems conform to the principle."

Background

The notion that self-organising biological systems – like a cell or brain – can be understood as minimising variational free energy is based upon

Helmholtz Hermann Ludwig Ferdinand von Helmholtz (31 August 1821 – 8 September 1894) was a German physicist and physician who made significant contributions in several scientific fields, particularly hydrodynamic stability. The Helmholtz Associatio ...

’s work on

unconscious inference Unconscious inference (German: unbewusster Schluss), also referred to as unconscious conclusion, is a term of perceptual psychology coined in 1867 by the German physicist and polymath Hermann von Helmholtz to describe an involuntary, pre-rational an ...

Helmholtz, H. (1866/1962). Concerning the perceptions in general. In Treatise on physiological optics (J. Southall, Trans., 3rd ed., Vol. III). New York: Dover. Available at https://web.archive.org/web/20180320133752/http://poseidon.sunyopt.edu/BackusLab/Helmholtz/ and subsequent treatments in psychology and machine learning. Variational free energy is a function of observations and a probability density over their hidden causes. This variational density is defined in relation to a probabilistic model that generates predicted observations from hypothesized causes. In this setting, free energy provides an approximation to Bayesian model evidence. Therefore, its minimisation can be seen as a Bayesian inference process. When a system actively makes observations to minimise free energy, it implicitly performs active inference and maximises the evidence for its model of the world. However, free energy is also an upper bound on the self-information of outcomes, where the long-term average of surprise is entropy. This means that if a system acts to minimise free energy, it will implicitly place an upper bound on the entropy of the outcomes – or sensory states – it samples.

Relationship to other theories

Active inference is closely related to the good regulator theorem and related accounts of

self-organisation Self-organization, also called spontaneous order in the social sciences, is a process where some form of overall order arises from local interactions between parts of an initially disordered system. The process can be spontaneous when suffic ...

, such as

self-assembly Self-assembly is a process in which a disordered system of pre-existing components forms an organized structure or pattern as a consequence of specific, local interactions among the components themselves, without external direction. When the ...

pattern formation The science of pattern formation deals with the visible, ( statistically) orderly outcomes of self-organization and the common principles behind similar patterns in nature. In developmental biology, pattern formation refers to the generation of ...

autopoiesis The term autopoiesis () refers to a system capable of producing and maintaining itself by creating its own parts. The term was introduced in the 1972 publication '' Autopoiesis and Cognition: The Realization of the Living'' by Chilean biologists ...

and practopoiesis. It addresses the themes considered in

cybernetics Cybernetics is a wide-ranging field concerned with circular causality, such as feedback, in regulatory and purposive systems. Cybernetics is named after an example of circular causal feedback, that of steering a ship, where the helmsperson ma ...

, synergetics and

embodied cognition Embodied cognition is the theory that many features of cognition, whether human or otherwise, are shaped by aspects of an organism's entire body. Sensory and motor systems are seen as fundamentally integrated with cognitive processing. The cognit ...

. Because free energy can be expressed as the expected energy of observations under the variational density minus its entropy, it is also related to the maximum entropy principle. Finally, because the time average of energy is action, the principle of minimum variational free energy is a

principle of least action The stationary-action principle – also known as the principle of least action – is a variational principle that, when applied to the '' action'' of a mechanical system, yields the equations of motion for that system. The principle states tha ...

Action and perception

Active inference applies the techniques of approximate Bayesian inference to infer the causes of sensory data from a 'generative' model of how that data is caused and then uses these inferences to guide action.

Bayes' rule In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For exa ...

characterizes the probabilistically optimal inversion of such a causal model, but applying it is typically computationally intractable, leading to the use of approximate methods. In active inference, the leading class of such approximate methods are variational methods, for both practical and theoretical reasons: practical, as they often lead to simple inference procedures; and theoretical, because they are related to fundamental physical principles, as discussed above. These variational methods proceed by minimizing an upper bound on the divergence between the Bayes-optimal inference (or ' posterior') and its approximation according to the method. This upper bound is known as the ''free energy'', and we can accordingly characterize perception as the minimization of the free energy with respect to inbound sensory information, and action as the minimization of the same free energy with respect to outbound action information. This holistic dual optimization is characteristic of active inference, and the free energy principle is the hypothesis that all systems which perceive and act can be characterized in this way. In order to exemplify the mechanics of active inference via the free energy principle, a generative model must be specified, and this typically involves a collection of probability density functions which together characterize the causal model. One such specification is as follows. The system is modelled as inhabiting a state space

X

, in the sense that its states form the points of this space. The state space is then factorized according to

X = \Psi\times S\times A\times R

, where

\Psi

is the space of 'external' states that are 'hidden' from the agent (in the sense of not being directly perceived or accessible),

S

is the space of sensory states that are directly perceived by the agent,

A

is the space of the agent's possible actions, and

R

is a space of 'internal' states that are private to the agent. The generative model is then the specification of the following density functions: * A sensory model,

p_S:\Psi\times A\times S\to\mathbb

, often written as

p_S(s, \psi,a)

, characterizing the likelihood of sensory data given external states and actions; * a stochastic model of the environmental dynamics,

p_\Psi:\Psi\times A\times\Psi\to\mathbb

, often written

p_\Psi(\psi_t, \psi_,a)

, characterizing how the external states are expected by the agent to evolve over time

t

, given the agent's actions; * an action model,

p_A:R\times S\times A\to\mathbb

, written

p_A(a, \mu,s)

, characterizing how the agent's actions depend upon its internal states and sensory data; and * an internal model,

p_R:S\times R\to\mathbb

, written

p_R(\mu, s)

, characterizing how the agent's internal states depend upon its sensory data. These density functions determine the factors of a " joint model", which represents the complete specification of the generative model, and which can be written as :

p(s,\psi_t,a,\mu, \psi_) = p_S(s, \psi,a)p_\Psi(\psi_t, \psi_,a)p_A(a, \mu,s)p_R(\mu, s)

. Bayes' rule then determines the "posterior density"

p_(\psi_t, s,a,\mu,\psi_)

, which expresses a probabilistically-optimal belief about the external state

\psi_t

given the preceding state and the agent's actions, sensory signals, and internal states. Since computing

p_

is computationally intractable, the free energy principle asserts the existence of a "variational density"

q(\psi_t, s,a,\mu,\psi_)

, where

q

is an approximation to

p_

. One then defines the free energy as :

\underset  = \underset  - \underset 
= \underset  + \underset  
  \geq \underset

and defines action and perception as the joint optimization problem :

a(t) = \underset   \

\mu(t) = \underset \

where the internal states

\mu

are typically taken to encode the parameters of the 'variational' density

q

and hence the agent's "best guess" about the posterior belief over

\Psi

. Note that the free energy is also an upper bound on a measure of the agent's (

marginal Marginal may refer to: * ''Marginal'' (album), the third album of the Belgian rock band Dead Man Ray, released in 2001 * ''Marginal'' (manga) * '' El Marginal'', Argentine TV series * Marginal seat or marginal constituency or marginal, in polit ...

, or average) sensory surprise, and hence free energy minimization is often motivated by the minimization of surprise.

Free energy minimisation

Free energy minimisation and self-organisation

Free energy minimisation has been proposed as a hallmark of self-organising systems when cast as

random dynamical system In the mathematical field of dynamical systems, a random dynamical system is a dynamical system in which the equations of motion have an element of randomness to them. Random dynamical systems are characterized by a state space ''S'', a set of maps ...

s. This formulation rests on a Markov blanket (comprising action and sensory states) that separates internal and external states. If internal states and action minimise free energy, then they place an upper bound on the entropy of sensory states: :

\lim_ \frac \underset   \ge
\lim_ \frac \int_0^T \underset \, dt = H (s\mid m)

This is because – under

ergodic In mathematics, ergodicity expresses the idea that a point of a moving system, either a dynamical system or a stochastic process, will eventually visit all parts of the space that the system moves in, in a uniform and random sense. This implies tha ...

assumptions – the long-term average of surprise is entropy. This bound resists a natural tendency to disorder – of the sort associated with the

second law of thermodynamics The second law of thermodynamics is a physical law based on universal experience concerning heat and energy interconversions. One simple statement of the law is that heat always moves from hotter objects to colder objects (or "downhill"), unle ...

and the

fluctuation theorem The fluctuation theorem (FT), which originated from statistical mechanics, deals with the relative probability that the entropy of a system which is currently away from thermodynamic equilibrium (i.e., maximum entropy) will increase or decrease ov ...

. However, formulating a unifying principle for the life sciences in terms of concepts from statistical physics, such as random dynamical system, non-equilibrium steady state and ergodicity, places substantial constraints on the theoretical and empirical study of biological systems with the risk of obscuring all features that make biological systems interesting kinds of self-organizing systems.

Free energy minimisation and Bayesian inference

All Bayesian inference can be cast in terms of free energy minimisation. When free energy is minimised with respect to internal states, the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...

between the variational and posterior density over hidden states is minimised. This corresponds to approximate

Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and ...

– when the form of the variational density is fixed – and exact

otherwise. Free energy minimisation therefore provides a generic description of Bayesian inference and filtering (e.g.,

Kalman filter For statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estima ...

ing). It is also used in Bayesian model selection, where free energy can be usefully decomposed into complexity and accuracy: :

\underset  = \underset  - \underset

Models with minimum free energy provide an accurate explanation of data, under complexity costs (c.f.,

Occam's razor Occam's razor, Ockham's razor, or Ocham's razor ( la, novacula Occami), also known as the principle of parsimony or the law of parsimony ( la, lex parsimoniae), is the problem-solving principle that "entities should not be multiplied beyond neces ...

and more formal treatments of computational costs). Here, complexity is the divergence between the variational density and prior beliefs about hidden states (i.e., the effective degrees of freedom used to explain the data).

Free energy minimisation and thermodynamics

Variational free energy is an information-theoretic functional and is distinct from thermodynamic (Helmholtz) free energy. However, the complexity term of variational free energy shares the same fixed point as Helmholtz free energy (under the assumption the system is thermodynamically closed but not isolated). This is because if sensory perturbations are suspended (for a suitably long period of time), complexity is minimised (because accuracy can be neglected). At this point, the system is at equilibrium and internal states minimise Helmholtz free energy, by the

principle of minimum energy The principle of minimum energy is essentially a restatement of the second law of thermodynamics. It states that for a closed system, with constant external parameters and entropy, the internal energy will decrease and approach a minimum value a ...

Free energy minimisation and information theory

Free energy minimisation is equivalent to maximising the

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...

between sensory states and internal states that parameterise the variational density (for a fixed entropy variational density). This relates free energy minimization to the principle of minimum redundancy

Free energy minimisation in neuroscience

Free energy minimisation provides a useful way to formulate normative (Bayes optimal) models of neuronal inference and learning under uncertainty and therefore subscribes to the

Bayesian brain Bayesian approaches to brain function investigate the capacity of the nervous system to operate in situations of uncertainty in a fashion that is close to the optimal prescribed by Bayesian statistics. This term is used in behavioural sciences and n ...

hypothesis. The neuronal processes described by free energy minimisation depend on the nature of hidden states:

\Psi = X \times \Theta \times \Pi

that can comprise time-dependent variables, time-invariant parameters and the precision (inverse variance or temperature) of random fluctuations. Minimising variables, parameters, and precision correspond to inference, learning, and the encoding of uncertainty, respectively.

Perceptual inference and categorisation

Free energy minimisation formalises the notion of

in perception and provides a normative (Bayesian) theory of neuronal processing. The associated process theory of neuronal dynamics is based on minimising free energy through gradient descent. This corresponds to generalised Bayesian filtering (where ~ denotes a variable in generalised coordinates of motion and

D

is a derivative matrix operator): :

\dot = D \tilde - \partial_F(s,\mu)\Big, _

Usually, the generative models that define free energy are non-linear and hierarchical (like cortical hierarchies in the brain). Special cases of generalised filtering include

ing, which is formally equivalent to

predictive coding In neuroscience, predictive coding (also known as predictive processing) is a theory of brain function which postulates that the brain is constantly generating and updating a "mental model" of the environment. According to the theory, such a ment ...

– a popular metaphor for message passing in the brain. Under hierarchical models, predictive coding involves the recurrent exchange of ascending (bottom-up) prediction errors and descending (top-down) predictionsMumford, D. (1992)
On the computational architecture of the neocortex
II. Biol. Cybern. , 66, 241–51. that is consistent with the anatomy and physiology of sensory and motor systems.

Perceptual learning and memory

In predictive coding, optimising model parameters through a gradient descent on the time integral of free energy (free action) reduces to associative or

Hebbian plasticity Hebbian theory is a neuroscientific theory claiming that an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptatio ...

and is associated with

synaptic plasticity In neuroscience, synaptic plasticity is the ability of synapses to strengthen or weaken over time, in response to increases or decreases in their activity. Since memories are postulated to be represented by vastly interconnected neural circuits ...

in the brain.

Perceptual precision, attention and salience

Optimizing the precision parameters corresponds to optimizing the gain of prediction errors (c.f., Kalman gain). In neuronally plausible implementations of predictive coding, this corresponds to optimizing the excitability of superficial pyramidal cells and has been interpreted in terms of attentional gain. PESAIM

Concerning the top-down vs bottom-up controversy that has been addressed as a major open problem of attention, a computational model has succeeded in illustrating the circulatory nature of reciprocation between top-down and bottom-up mechanisms. Using an established emergent model of attention, namely, SAIM, the authors suggested a model called PE-SAIM that – in contrast to the standard version – approaches the selective attention from a top-down stance. The model takes into account the forwarding prediction errors sent to the same level or a level above to minimize the energy function indicating the difference between data and its cause or – in other words – between the generative model and posterior. To enhance validity, they also incorporated the neural competition between the stimuli in their model. A notable feature of this model is the reformulation of the free energy function only in terms of prediction errors during the task performance:

\dfrac=x^_-b^\varepsilon^_+b^\sum_(\varepsilon^_)

where

E^

is the total

energy function Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...

of the neural networks entail, and

\varepsilon^_

is the prediction error between the generative model (prior) and posterior changing over time. Comparing the two models reveals a notable similarity between their respective results while also highlighting a remarkable discrepancy, whereby – in the standard version of the SAIM – the model's focus is mainly upon the excitatory connections, whereas in the PE-SAIM, the inhibitory connections are leveraged to make an inference. The model has also proved to be fit to predict the EEG and fMRI data drawn from human experiments with high precision. In the same vein, Yahya et al. also applied the free energy principle to propose a computational model for template matching in covert selective visual attention that mostly relies on SAIM. According to this study, the total free energy of the whole state-space is reached by inserting top-down signals in the original neural networks, whereby we derive a dynamical system comprising both feed-forward and backward prediction error.

Active inference

When gradient descent is applied to action

\dot = -\partial_aF(s,\tilde)

, motor control can be understood in terms of classical reflex arcs that are engaged by descending (corticospinal) predictions. This provides a formalism that generalizes the equilibrium point solution – to the degrees of freedom problem – to movement trajectories.

Active inference and optimal control

Active inference is related to

optimal control Optimal control theory is a branch of mathematical optimization that deals with finding a control for a dynamical system over a period of time such that an objective function is optimized. It has numerous applications in science, engineering and ...

by replacing value or cost-to-go functions with prior beliefs about state transitions or flow. This exploits the close connection between Bayesian filtering and the solution to the Bellman equation. However, active inference starts with (priors over) flow

f = \Gamma \cdot \nabla V + \nabla \times W

that are specified with scalar

V(x)

and vector

W(x)

value functions of state space (c.f., the

Helmholtz decomposition In physics and mathematics, in the area of vector calculus, Helmholtz's theorem, also known as the fundamental theorem of vector calculus, states that any sufficiently smooth, rapidly decaying vector field in three dimensions can be resolved into ...

). Here,

\Gamma

is the amplitude of random fluctuations and cost is

c(x) = f \cdot \nabla V + \nabla \cdot \Gamma \cdot V

. The priors over flow

p(\tilde\mid m)

induce a prior over states

p(x\mid m) = \exp (V(x))

that is the solution to the appropriate forward

Kolmogorov equations In probability theory, Kolmogorov equations, including Kolmogorov forward equations and Kolmogorov backward equations, characterize continuous-time Markov processes. In particular, they describe how the probability that a continuous-time Markov pr ...

. In contrast, optimal control optimises the flow, given a cost function, under the assumption that

W = 0

(i.e., the flow is curl free or has detailed balance). Usually, this entails solving backward

Active inference and optimal decision (game) theory

Optimal decision problems (usually formulated as partially observable Markov decision processes) are treated within active inference by absorbing utility functions into prior beliefs. In this setting, states that have a high utility (low cost) are states an agent expects to occupy. By equipping the generative model with hidden states that model control, policies (control sequences) that minimise variational free energy lead to high utility states. Neurobiologically, neuromodulators such as

dopamine Dopamine (DA, a contraction of 3,4-dihydroxyphenethylamine) is a neuromodulatory molecule that plays several important roles in cells. It is an organic chemical of the catecholamine and phenethylamine families. Dopamine constitutes about 80% o ...

are considered to report the precision of prediction errors by modulating the gain of principal cells encoding prediction error.Friston, K. J. Shiner T, FitzGerald T, Galea JM, Adams R, Brown H, Dolan RJ, Moran R, Stephan KE, Bestmann S. (2012)
Dopamine, affordance and active inference
PLoS Comput. Biol., 8(1), p. e1002327. This is closely related to – but formally distinct from – the role of dopamine in reporting prediction errors ''per se'' and related computational accounts.

Active inference and cognitive neuroscience

Active inference has been used to address a range of issues in

cognitive neuroscience Cognitive neuroscience is the scientific field that is concerned with the study of the biological processes and aspects that underlie cognition, with a specific focus on the neural connections in the brain which are involved in mental process ...

, brain function and neuropsychiatry, including action observation, mirror neurons, saccades and visual search, eye movements, sleep, illusions, attention, action selection, consciousness, hysteria and psychosis. Explanations of action in active inference often depend on the idea that the brain has 'stubborn predictions' that it cannot update, leading to actions that cause these predictions to come true.

References

{{Reflist, 3

External links

Behavioral and Brain Sciences (by Andy Clark)
Biological systems Systems theory Computational neuroscience Mathematical and theoretical biology