Making sense of the Kolmogorov backward equation (for diffusion processes)
I once struggled quite a lot to wrap my head around convex duality. I wouldn’t go as far as to say I understand it now, but at least I got used to it, and I feel like it “makes sense”. However I was recently again in a similar situation, with diffusion processes and specifically the Kolmogorov backward equation: I could follow its derivation on a formal level, but I had a hard time understanding what it means. In this post I write down some calculations that were somewhat helpful to me, to make sense of it.
Most of the content here is taken or extrapolated from the recent book [1, Chapters 3 and 8], which I find strikes a nice balance between concision and clarity. A more rigorous and detailed presentation can be found in the also very nice book [2, Chapters 6 and 10]. Unlike for previous posts, I don’t include a summary of the relevant background; let me just say that in my experience, formal manipulations of SDEs make intuitive sense even without knowing any of the theory behind, except for Ito’s formula which is not obvious at first (but easy to look up, e.g. in this 3minute YouTube video).
Notation.
”\(\nabla\)” denotes gradient and “\(\nabla \cdot\)” denotes divergence. For two matrices \(A, B \in \mathbb{R}^{d \times d}\), \(A:B = \sum_{ij} A_{ij} B_{ij} = \mathop{\mathrm{Tr}}(A B^\top)\). For a matrix field \(A: \mathbb{R}^d \to \mathbb{R}^{d \times d}\), \(\nabla^2 : A = \sum_{ij} \partial_i \partial_j A_{ij}\).
For \(\varphi \in C_b(\mathbb{R}^d)\) and \(\mu \in \mathcal{M}(\mathbb{R}^d)\), we write indifferently \(\left\langle \mu, \varphi \right\rangle\) or \(\mu \cdot \varphi\) or \(\mu \varphi\) to denote \(\int_{\mathbb{R}^d} \varphi d\mu\). For a transition kernel \(P\) and a probability distribution \(\mu\), we may write \(\mu P\) for \(P^* \mu\). For a finitestate timehomogeneous Markov chain for example, this means that we can represent \(P\) as a square matrix with \(P_{ij} = \mathbb{P}(X_{k+1} = j  X_k = i)\), probability distributions as row vectors, and test functions as column vectors; in particular, \(\mu P^k \varphi = \mathbb{E}_{X_0 \sim \mu}[\varphi(X_k)]\).
We will use the terms “diffusion process” and “solution of a SDE” interchangeably, following the remark of [1, Sec. 7.3]. This is justified more rigorously by [2, Sec. 9.79.9].
Setting and statement of the equations
Consider a particle whose spatial position \(X_t \in \mathbb{R}^d\) evolves in time according to the SDE
\[dX_t = b_t(X_t) + \sigma_t(X_t) dW_t.\]I will ignore regularity issues, so here \(b\) and \(\sigma\) are nice smooth functions, say, with uniformly bounded derivatives of all orders.
The forward equation.
If \(X_0 \sim \mu_0\), then \(\mu_t = \mathrm{Law}(X_t)\) is a distributional solution of the PDE, called FokkerPlanck equation or Kolmogorov Forward Equation,
\[\label{eq:setting:FP_KFE} \tag{KFE} \partial_t \mu_t = \nabla \cdot [b_t \mu_t] + \frac{1}{2} \nabla^2 : [\sigma_t \sigma_t^\top \mu_t] ~~~~\text{with initial condition}~~~~ \mu_0.\]By definition of distributional solutions, it suffices to check that for any \(\varphi \in C^\infty_c(\mathbb{R}^d)\) it holds
\[\frac{d}{dt} \mathbb{E}\varphi(X_t) = \nabla \varphi(X_t) \cdot b_t(X_t) + \frac{1}{2} \nabla^2 \varphi(X_t) : \sigma_t(X_t) \sigma_t(X_t)^\top,\]which can be shown straightforwardly by computing \(d \varphi(X_t)\) thanks to Ito’s formula and by taking expectations.
The backward equation.
For any fixed test function \(\varphi \in C^\infty_c\) and final time \(t\), let
\[\forall s \leq t, \forall y \in \mathbb{R}^d,~ u(y, s) = \mathbb{E}[\varphi(X_t)  X_s = y].\]Then \(u(y,s)\) is the unique solution to the PDE, called Kolmogorov Backward Equation,
\[\label{eq:setting:KBE} \tag{KBE} \partial_s u_s = \nabla u_s \cdot b_s + \frac{1}{2} \nabla^2 u_s : \sigma_s \sigma_s^\top ~~~~\text{with final condition}~~~~ u(\cdot, t) = \varphi(\cdot).\]Let us check that \(u(y, s) := \mathbb{E}[\varphi(X_t)  X_s = y]\) satisfies \(\eqref{eq:setting:KBE}\). The final condition \(u(\cdot, t) = \varphi(\cdot)\) is immediate from the definition of \(u\). Next consider the process \(u(X_\tau, \tau)\) for \(0 \leq \tau \leq t\), which by Ito’s formula evolves as
\[du(X_\tau, \tau) = \left[ \partial_s u(X_\tau, \tau) + \partial_y u(X_\tau, \tau) \cdot b_\tau(X_\tau) + \frac{1}{2} \partial_{yy}^2 u(X_\tau, \tau) : \sigma_\tau(X_\tau) \sigma_\tau(X_\tau)^\top \right] d\tau + [...] dW_\tau.\]Here \(\partial_s u\) denotes partial derivative of \(u\) w.r.t. its second variable, and \(\partial_y\), \(\partial_{yy}^2\) are its partial derivatives w.r.t. its first variable. Integrating over \(\tau \in [s, r]\) for some fixed \(s\) and \(r \leq t\), and taking expectations conditioned on \(X_s = y\) which we denote as \(\mathbb{E}^{y,s} = \mathbb{E}[\cdot  X_s=y]\), we have
\[\mathbb{E}^{y,s}[ u(X_r, r)  u(X_s, s) ] = \mathbb{E}^{y,s} \int_s^r \left[ \partial_s u(X_\tau, \tau) + \partial_y u(X_\tau, \tau) \cdot b_\tau(X_\tau) + \frac{1}{2} \partial_{yy}^2 u(X_\tau, \tau) : \sigma_\tau(X_\tau) \sigma_\tau(X_\tau)^\top \right] d\tau.\]Now by definition of \(u(y, s) = \mathbb{E}[\varphi(X_t)  X_s=y]\), the lefthand side simplifies as
\[u(X_r, r)  u(X_s, s) = \mathbb{E}[\varphi(X_t)  X_r=X_r]  \mathbb{E}[\varphi(X_t)  X_s=X_s] = 0\]and in particular its expectation is also zero. So, differentiating the above identity w.r.t. \(r\), we have
\[\forall s \leq r \leq t,~ \mathbb{E}^{y,s} \left[ \partial_s u(X_r, r) + \partial_y u(X_r, r) \cdot b_r(X_r) + \frac{1}{2} \partial_{yy}^2 u(X_r, r) : \sigma_r(X_r) \sigma_r(X_r)^\top \right] = 0.\]In particular evaluating at \(r=s\), we have
\[\forall s \leq t,~ \partial_s u(y, s) + \partial_y u(y, s) \cdot b_s(y) + \frac{1}{2} \partial_{yy}^2 u(y, s) : \sigma_s(y) \sigma_s(y)^\top = 0,\]which is the desired PDE \(\eqref{eq:setting:KBE}\). (There is a typo in [1, Sec. 8.3]: they do not introduce a free variable \(r \leq t\), and instead integrate the SDE followed by \(u(X_\tau, \tau)\) over all of \(\tau \in [s, t]\); then they say they differentiate w.r.t. \(t\), but here \(t\) was fixed before even defining \(u\).)
Conversely, the same calculations show that any (nice and regular enough) solution \(u(y,s)\) of \(\eqref{eq:setting:KBE}\) must be equal to \(\mathbb{E}[\varphi(X_t)  X_s=y]\). Indeed, consider the process \(u(X_\tau, \tau)\); write down the SDE that it follows by Ito’s formula; integrate it over \(\tau \in [s,t]\) and take expectations conditioned on \(X_s=y\). This yields
\[\mathbb{E}^{y,s}[ u(X_t, t)  u(X_s, s) ] = \mathbb{E}^{y,s} \int_s^t \left[ \partial_s u_\tau + \partial_y u_\tau \cdot b_\tau + \frac{1}{2} \partial_{yy}^2 u_\tau : \sigma_\tau \sigma_\tau^\top \right](X_\tau) d\tau = 0\]since \(u\) is a solution of \(\eqref{eq:setting:KBE}\). Hence, by the final condition \({ u(\cdot, t) = \varphi(\cdot) }\),
\(% \EE[ u(X_t, t)  X_s=y]  \EE[ u(X_s,s)  X_s=y] \mathbb{E}^{y,s}[ u(X_t, t)  u(X_s, s) ] = \mathbb{E}[\varphi(X_t)  X_s=y]  u(y,s) = 0.\)
The timehomogeneous case.
Somewhat confusingly, in the case of an autonomous process, i.e., when \(b_t(x) = b(x)\) and \(\sigma_t(x) = \sigma(x)\) do not depend on time, there is a different but very similarlooking way to formulate the Kolmogorov Backward Equation. Fix again a \(\varphi \in C^\infty_c\) and let
\[\forall t \geq 0,~ \forall x \in \mathbb{R}^d,~ v(x, t) = \mathbb{E}[\varphi(X_t)  X_0 = x].\]Then \(v(x,t)\) is a solution to the PDE
\[\label{eq:setting:KBE_homog} \partial_t v_t = \nabla v_t \cdot b + \frac{1}{2} \nabla^2 v_t : \sigma \sigma^\top ~~~~\text{with initial condition}~~~~ v(\cdot, 0) = \varphi(\cdot).\]This fact follows from the Kolmogorov Backward Equation with an appropriate change of variable:
\[\forall 0 \leq s \leq t, \forall y \in \mathbb{R}^d,~~ v(y, ts) = \mathbb{E}[\varphi(X_{ts})  X_0=y] = \mathbb{E}[\varphi(X_t)  X_s=y]\]by timehomogeneity.
The forward and backward equations are visibly connected, which is maybe not surprising since they both describe the same diffusion process. Our goal in the next section is to clarify the nature of the connection.
The Markov process point of view
To sum up: we consider a diffusion process over \(\mathbb{R}^d\) described by the SDE
\[\label{eq:markov:SDE} \tag{1} % dX_t = b(X_t, t) dt + \sigma(X_t, t) dW_t. dX_t = b_t(X_t) dt + \sigma_t(X_t) dW_t\]and we define the associated Kolmogorov Forward resp. Backward Equations as the PDEs
\[\begin{align} \label{eq:markov:KFE1} \tag{F1} \partial_t \mu_t &= \nabla \cdot [\mu_t b_t] + \frac{1}{2} \nabla^2 : [\mu_t \sigma_t \sigma_t^\top] ~~~~\text{with initial condition}~~~~ \mu_0 \\ \label{eq:markov:KBE1} \tag{B1} \text{and}~~~~ \partial_s u_s &= b_s \cdot \nabla u_s + \frac{1}{2} \sigma_s \sigma_s^\top : \nabla^2 u_s \qquad ~~~~\text{with final condition}~~~~ u(\cdot, t) = \varphi(\cdot).\end{align}\]We showed above that, if \(X_0 \sim \mu_0\) then \(\mu_t = \mathrm{Law}(X_t)\) is a solution of \(\eqref{eq:markov:KFE1}\), and for any fixed \(\varphi \in C^\infty_c(\mathbb{R}^d)\) and \(t > 0\), \(u(y,s) = \mathbb{E}[\varphi(X_t)  X_s=y]\) is the unique solution of \(\eqref{eq:markov:KBE1}\).
The forward and backward equations are “adjoints”.
Let \(\mathcal{L}_t\) the linear operator from \(C^\infty_c(\mathbb{R}^d)\) to itself defined by
\[(\mathcal{L}_t \varphi)(x) = b_t(x) \cdot \nabla \varphi(x) + \frac{1}{2} \sigma_t(x) \sigma_t(x)^\top : \nabla^2 \varphi(x),\]called the infinitesimal generator of the diffusion process. Let \(\mathcal{L}_t^*\) its \(L^2(\mathbb{R}^d)\) adjoint, i.e., the operator from \(\mathcal{M}(\mathbb{R}^d)\) to itself ^{1} such that \(\forall \varphi, \forall \mu, \int (\mathcal{L}_t \varphi) d\mu = \int \varphi d(\mathcal{L}_t^* \mu)\). By explicit computations (integration by parts) one can check it is given by
\[\mathcal{L}_t^* \mu = \nabla \cdot [\mu b_t] + \frac{1}{2} \nabla^2 : [\mu \sigma_t \sigma_t^\top].\]With these notations, \(\eqref{eq:markov:KFE1}\) and \(\eqref{eq:markov:KBE1}\) write respectively
\[\begin{align} \partial_t \mu_t &= \mathcal{L}_t^* \mu_t ~~~~\text{with initial condition}~~~~ \mu_0 \\ \text{and}~~~~ \partial_s u_s &= \mathcal{L}_s u_s ~~~~\text{with final condition}~~~~ u(\cdot, t) = \varphi(\cdot).\end{align}\]We have identified the sense in which the forward and backward equations are connected: their generators (in the sense of PDEs) are adjoints of each other, up to sign. But this still doesn’t tell me why they are connected like this... To phrase it differently, it was not clear from their interpretations as describing the evolutions of \(\mathrm{Law}(X_t)\) resp. of \(\mathbb{E}[\varphi(X_t)  X_s=y]\), that \(\eqref{eq:markov:KFE1}\) and \(\eqref{eq:markov:KBE1}\) should have adjoint generators. Next we unroll a point of view that makes it obvious that it must be the case.
The Markov transition kernels.
The solution \(X_t\) of \(\eqref{eq:markov:SDE}\) is a Markov process, i.e., \(\{ X_\tau\}_{\tau>t}\) is independent of \(\{ X_\tau\}_{\tau<t}\) conditionally on \(X_t\) for all \(t\); this can easily be checked by inspecting the definition of solutions of SDEs. Let \(\mathcal{P}^{s,t}\) the transition kernels of the Markov process \(X_t\), i.e., the operators such that
\[\forall s \leq t,~ \forall \varphi, \forall x,~ (\mathcal{P}^{s,t} \varphi)(x) = \mathbb{E}[\varphi(X_t)  X_s = x].\]By definition their \(L^2(\mathbb{R}^d)\) adjoints are given by \(\forall \mu, \left\langle (\mathcal{P}^{s,t})^* \mu, \bullet \right\rangle = \mathbb{E}_{X_s \sim \mu} [\bullet(X_t)]\). Equivalently and perhaps more intuitively,
\[\forall s \leq t,~~ X_s \sim \mu_s \implies X_t \sim \mu_s \mathcal{P}^{s,t} = \mu_t\](recall that we denote indifferently \(\mu_s \mathcal{P}^{s,t}\) or \((\mathcal{P}^{s,t})^* \mu_s\)). We can also write this symbolically, in terms of probability density functions, as
\[\mathcal{P}^{s,t}(y, dx) = \mathbb{P}(X_t \in dx  X_s = y) = p(x,t  y,s) dx\]where \(dx\) represents a small volume around \(x\), and \(\mathbb{P}\left( X_t \in B  X_s \in A \right) = \int_B dx \int_A dy~ p(x,t  y,s)\).
With these notations, the Kolmogorov Forward and Backward Equations \(\eqref{eq:markov:KFE1}\), \(\eqref{eq:markov:KBE1}\) write
\[\begin{align} \text{for any fixed $s$},~ \forall t \geq s,~ ~~~~ \partial_t p(\cdot,t  y,s) &= \mathcal{L}_t^* p(\cdot,t  y,s) ~~~~\text{w/ initial cond.}~~~~ p(\cdot,s  y,s) = \delta_y(\cdot), \label{eq:markov:KFE2} \tag{F2} \\ \text{for any fixed $t$},~ \forall s \leq t,~ ~~ \partial_s p(x,t  \cdot,s) &= \mathcal{L}_s p(x,t  \cdot,s) \quad ~~\text{w/ final cond.}~~~~ p(x,t  \cdot,t) = \delta_x(\cdot). \label{eq:markov:KBE2} \tag{B2} \end{align}\]It takes a few minutes of focus to check that the above formulas do indeed have the same meaning as the interpretations we gave for \(\eqref{eq:markov:KFE1}\) and \(\eqref{eq:markov:KBE1}\). It is worthwhile to stop and actually check. For \(\eqref{eq:markov:KBE1}\)/\(\eqref{eq:markov:KBE2}\), it can be helpful to start by testing both sides of \(\eqref{eq:markov:KBE2}\) against some fixed \(\varphi \in C^\infty_c\) (i.e., multiply by \(\varphi(x)\) and integrate w.r.t. \(x\), for each \(s\)).
With the above reformulations, I’m almost satisfied. Indeed the above equations are particular instances of general identities for (“regular enough”) Markov processes. It only remains to give an explanation of those general identities. For any Markov process with transition kernels \(\mathcal{P}^{s,t}\), note that by definition (of Markov processes and of the transition kernels), we have the ChapmanKolmogorov equation
\[\label{eq:markov:chapman_kolmo} \forall s \leq \tau \leq t,~ \mathcal{P}^{s,\tau} \mathcal{P}^{\tau,t} (y,dx) = \int_{\mathbb{R}^d} \mathcal{P}^{s,\tau}(y, dz) \mathcal{P}^{\tau,t}(z, dx) = \mathcal{P}^{s,t}(y,dx).\]The idea is that by differentiating these identities, one obtains differential equations that characterize the Markov process. To explain this in a nonconfusing way, I find it helpful to focus on the discretespace setting.
Finitestatespace heuristic.
For the duration of this paragraph, pretend that the space \(\mathbb{R}^d\) is discrete and even finite. Instead of \(p(x,t  y,s)\) with \(\int_{\mathbb{R}^d} p(x,t  y,s) dx = 1\), we will write the transition probabilities \(\mathbb{P}(X_t=x  X_s=y)\) as \(p^{s,t}_{yx}\) with \(\sum_{x \in \mathbb{R}^d} p^{s,t}_{yx} = 1\). Then we can write the ChapmanKolmogorov equation in matrix notation as
\[\label{eq:markov:champan_kolmo_discrete} \tag{2} \forall s \leq \tau \leq t,~~ \sum_{z \in \mathbb{R}^d} p^{s,\tau}_{yz}~ p^{\tau,t}_{zx} = p^{s,t}_{yx}, \forall y,x ~~~~\text{i.e.}~~~~ p^{s,\tau} p^{\tau,t} = p^{s,t}.\]Assume the following quantities \([Q_s]_{ij}\) exist (see [1, Sec. 3.5, Eq. (3.11)(3.12)] for sufficient conditions in terms of the \(p^{s,t}_{ij}\)):
\[Q_s = {\left.\frac{\partial}{\partial t} p^{s,t}\right_{t=s}} = \lim_{h \downarrow 0} \frac{p^{s,t+h}  p^{s,t}}{h}.\]The matrix \(Q_s\) is called the generator of the Markov jump process with transition kernels \(p^{s,t}\), and it turns out to completely characterize the Markov process. Now consider the following matrix identities, which are just rewritings of \(\eqref{eq:markov:champan_kolmo_discrete}\):
\[\forall s \leq t, \forall h \geq 0,~~~ \begin{cases} p^{s,t}~ p^{t,t+h} = p^{s, t+h} \\ p^{s,s+h}~ p^{s+h,t} = p^{s,t}. \end{cases}\]By differentiating w.r.t. \(h\) and evaluating at \(h=0\), we obtain
\[\begin{cases} p^{s,t}~ Q_t = \partial_t p^{s,t} \\ Q_s~ p^{s,t} + p^{s,s} \partial_s p^{s,t} = 0 \end{cases} ~~~~~~\text{i.e.,}~~~~ \begin{cases} \partial_t p^{s,t} = p^{s,t}~ Q_t \\ \partial_s p^{s,t} = Q_s~ p^{s,t} \end{cases}\]since \(p^{s,s}_{yx} = \mathbb{1}_{y=x}\) is the identity matrix. These equations are exactly the Kolmogorov Forward and Backward Equations for Markov jump processes. ^{2} Note the similarity with the corresponding equations for diffusion processes \(\eqref{eq:markov:KFE2}\), \(\eqref{eq:markov:KBE2}\)!
In this post, we started from a SDE and showed that its transition density function satisfies the PDEs with generators \(\mathcal{L}_t^*\) resp. \(\mathcal{L}_s\); we then interpreted those equations as particular instances of the general Kolmogorov Forward and Backward Equations. One can also go in the other direction and show that a continuous Markov process with \(\mathcal{L}_t\) as the generator (in the sense of stochastic processes) can be represented by a SDE with corresponding drift \(b_t\) and diffusion \(\sigma_t\) coefficients. This alternative direction is nicely presented in [2, Chapters 610].
References
[1] Weinan, E., Tiejun Li, and Eric VandenEijnden. Applied stochastic analysis. Vol. 199. American Mathematical Soc., 2021.
[2] Baldi, Paolo. Stochastic calculus. Springer International Publishing, 2017.

Again, I am being extremely loose with questions of regularity: it doesn’t make too much sense to consider \(\mathcal{L}_t\) as an operator over \(C^\infty_c(\mathbb{R}^d)\) and \(\mathcal{L}_t^*\) as an operator over \(\mathcal{M}(\mathbb{R}^d)\). The important thing is that \(\mathcal{L}_t\) morally acts on test functions, and \(\mathcal{L}_t^*\) on probability distributions. ↩

The Wikipedia articles on this subject are a little bit messy currently, there are three different pages titled “Kolmogorov equations”. The relevant one here is https://en.wikipedia.org/w/index.php?title=Kolmogorov_equations_(continuoustime_Markov_chains)&oldid=1156787598. ↩