3 Varieties of prediction error minimization

The central idea here is that, on average and over the long run, surprising states should be avoided, or, prediction error should be minimized. Prediction error minimization can occur in a number of ways, all familiar from debates on inference to the best explanation and many descriptions of scientific, statistical inference.

First, the model parameters can be revised in the light of prediction error, which will gradually reduce the error and improve the model fit. This is perception, and corresponds to how a scientist seeks to explain away surprising evidence by revising a hypothesis. This perceptual process was alluded to above.

Slightly more formally, this idea can be expressed in terms of the free energy principle in the following terms. The free energy (or sum of prediction error) equals the negative log probability of the sensory evidence, given the model (the surprise) + a KL-divergence between the selected hypothesis (the hypothesis about the causes of the sensory input, which the system can change to change the free energy), and the true posterior probability of the hypothesis given the input and model. Since the KL-divergence is never negative, this means that the free energy will bound (be larger than) the surprise. Therefore, the system just needs to minimize the divergence to approximate the surprisal.

Second, the model parameters can be kept stable and used to generate predictions—in particular, proprioceptive predictions, which are delivered to the classic reflex arcs and fulfilled there until the expected sensory input is obtained. This is action, and corresponds to how a scientist may retain a hypothesis and control the environment for confounds until the expected evidence obtains. Since action is prediction error minimization with a different direction of fit, it is labeled active inference.

Slightly more formally (and still following Friston), this notion of action arises from another reorganization of the free energy principle. Here, free energy equals complexity minus accuracy. Complexity may be taken as the opposite of simplicity, and is measured as a KL-divergence between the prior probability of the hypothesis (i.e., before the evidence came in) and the hypothesis selected in the light of the evidence. Intuitively, this divergence is large if many changes were made to fit the hypothesis—that is, if the hypothesis has significant complexity compared to the old hypothesis. Accuracy is the surprise about the sensory input given the selected hypothesis—that is, how well each hypothesis fits the input. Free energy is minimized by changing the sensory data, such that accuracy increases. If the selected hypothesis is not changed, then this amounts to sampling the evidence selectively such that it becomes less surprising. This can only happen through action, where the organism re-organizes its sensory organs or whole body, or world, in such a way that it receives the expected sensory data (e.g., holding something closer in order to smell it).

There are further questions one must ask about action: how are goals chosen and how do we work out how to obtain them? The free energy principle can be brought to bear on these questions too. In a very basic way, our goals are determined by our expected interoceptive and proprioceptive states, which form the basis of homeostasis. If we assume that we can approximate these expected states, as described above, what remains is a learning task concerning how we can maintain ourselves in them. This relies on internal models of the world, including, crucially, modeling how we ourselves, through our action, impact on the sensory input that affects our internal states. Further, we need to minimize the divergence between, on the one hand, the states we can reach from a given point and, on the other, the states we expect to be in. Research is in progress to set out the details of this ambitious part of the free energy program.

Third, the model parameters can be simplified (cf. complexity reduction), such that the model is not underfitted or overfitted, both of which will generate prediction error in the long run. This corresponds to Bayesian model selection, where complexity is penalized, and also to how a scientist will prefer simpler models in the long run even though a more complex model may fit the current evidence very well. The rationale for this is quite intuitive: a model that is quite complex is designed to fit a particular situation with particular situation-specific, more or less noisy, interfering factors. This implies that it will generalize poorly to new situations, on the assumption that the world is a fairly noisy place with state-dependent uncertainty. Therefore, to minimize prediction error in the long run it is better to have less complex models. Conversely, when encountering a new situation, one should not make too radical changes to one’s prior model. One way to ensure this is to pick the model that makes the least radical changes but still explains the new data within expected levels of noise. This is just what Bayesian model selection amounts to, and this is enshrined in the formulations of the free energy principle. A good example of this is what happens during sleep, when there is no trustworthy sensory input and the brain instead seems to resort to complexity reduction on synthetic data (Hobson & Friston 2012).

Fourth, the hypotheses can be modulated according to the precision of prediction error, such that prediction error minimization occurs on the basis of trustworthy prediction error; this amounts to gain control, and functionally becomes attention. This corresponds to the necessity for assessment of variance in statistical inference, as well as to how a scientist is guided by, and seeks out, measurements that are expected to be precise more than measurements that are expected to be imprecise.

Precision optimization is attention because it issues in a process of weighting some prediction errors more than others, where the weights need to sum to one in order to be meaningful. Hence, peaks across the prediction error landscape reflect both the magnitude of the prediction error per se and the weight given to that error based on how precise it is expected to be. This moves the prediction error effort around, much like one would expect the searchlight of attention to move around.

Within this framework, there is room for both endogenous and exogenous attention. Endogenous attention is top-down modulation of prediction error gain based on learned patterns of precision. Exogenous attention is an intrinsic gain operation on error units, sparked by the current signal strength in the sensory input; this is based on a very basic learned regularity in nature, namely that strong signals tend to have high signal to noise ratio—that is, high precision.

In all this, there is a very direct link between perception, action, and attention, which will serve to illustrate some of the key characteristics of the framework. In particular, expected precision drives action such that sensory sampling is guided by hypotheses that the system expects will generate precise prediction error. A very simple example of this is hand movement. For hand movement to occur, the system needs to prioritize one of two competing possible hypotheses. The first hypothesis is that the hand is not moving, which predicts a particular kind of (unchanging) proprioceptive and kinesthetic input; the second hypothesis is (the false one) that the hand is moving, which predicts a different (changing) flow of proprioceptive and kinesthetic input. Movement will only occur if the second hypothesis is prioritized, which corresponds to the agent harboring the belief that the hand is actually moving. If this belief wins, then proprioceptive predictions are passed to the body, where classic reflex arcs fulfill them. Movement is then conceived as a kind of self-fulfilling prophecy.

A crucial question here is how the actually false hypothesis might be prioritized, given that the actually true hypothesis (that the agent is not moving) has evidence in its favor (since the agent is in fact not moving). Here expected precisions play a role, which means that action essentially turns into an attentional phenomenon: in rather revisionist terms, agency reduces to self-organisation guided by long term prediction error minimization. Hypotheses can be prioritized on the basis of their expected precision: hence if future proprioceptive input is expected to be more precise than current proprioceptive input, the gain on the current input will be turned down, depriving the hypothesis that the agent is not moving of evidence. Now the balance shifts in favor of the actually false hypothesis, which can then begin to pass its predictions to the sensorimotor system. This rather inferential process is then what causes movement to occur. It is an essentially attentional process because acting occurs when attention is withdrawn from the actual input (Brown et al. 2013).

The outstanding issue for this story about what it takes to act in the world is why there is an expectation that future proprioceptive input will be more precise than the current input. One possibility here is that this is based on a prior expectation that exploration (and hence movement) yields greater prediction error minimization gains in the long run than does staying put. Conversely, this is the expectation that the current state will lose its high-precision status over time. Writ large, this is the prior expectation concerning precisions (i.e., a hyperprior), which says that the world is a changing place such that one should not retain the same hypotheses for too long: when the posterior probability of a hypothesis becomes the new prior, it will soon begin to decrease in probability. This is an important point because it shows that the ability to shift attention around in order to cause action is not itself an action performed by a homunculus. Rather, it is just a further element of extracting statistical information (about precisions) from the world.