Jekyll2023-09-21T08:32:39+00:00https://guillaumew16.github.io/feed.xmlGuillaume Wang’s github pageMachine learning theory, applied mathematics, etc.Guillaume WangMaking sense of the Kolmogorov backward equation (for diffusion processes)2023-08-26T00:00:00+00:002023-08-26T00:00:00+00:00https://guillaumew16.github.io/math/2023/08/26/KBE<p><em>I once struggled quite a lot to wrap my head around convex duality. I
wouldn’t go as far as to say I understand it now, but at least I got
used to it, and I feel like it “makes sense”. However I was recently again in a
similar situation, with diffusion processes and specifically the
Kolmogorov backward equation: I could follow its derivation on a formal
level, but I had a hard time understanding what it means. In this
post I write down some calculations that were somewhat
helpful to me, to make sense of it.</em></p>
<p>Most of the content here is taken or extrapolated from the recent
book
[1, Chapters 3 and 8], which I find strikes a nice
balance between concision and clarity. A more rigorous and detailed
presentation can be found in the also very nice book
[2, Chapters 6 and 10]. Unlike for previous posts, I
don’t include a summary of the relevant background; let me just say that
in my experience, formal manipulations of SDEs make intuitive sense even
without knowing any of the theory behind, except for Ito’s formula which
is not obvious at first (but easy to look up,
e.g. in <a href="https://www.youtube.com/watch?v=kQTi2ckWufg">this 3-minute YouTube video</a>).</p>
<h4 id="notation">Notation.</h4>
<p>”\(\nabla\)” denotes gradient and “\(\nabla \cdot\)” denotes divergence. For
two matrices \(A, B \in \mathbb{R}^{d \times d}\),
\(A:B = \sum_{ij} A_{ij} B_{ij} = \mathop{\mathrm{Tr}}(A B^\top)\). For a
matrix field \(A: \mathbb{R}^d \to \mathbb{R}^{d \times d}\),
\(\nabla^2 : A = \sum_{ij} \partial_i \partial_j A_{ij}\).</p>
<p>For \(\varphi \in C_b(\mathbb{R}^d)\) and
\(\mu \in \mathcal{M}(\mathbb{R}^d)\), we write indifferently
\(\left\langle \mu, \varphi \right\rangle\) or \(\mu \cdot \varphi\) or
\(\mu \varphi\) to denote \(\int_{\mathbb{R}^d} \varphi d\mu\). For a
transition kernel \(P\) and a probability distribution \(\mu\), we may write
\(\mu P\) for \(P^* \mu\). For a finite-state time-homogeneous Markov chain
for example, this means that we can represent \(P\) as a square matrix
with \(P_{ij} = \mathbb{P}(X_{k+1} = j | X_k = i)\), probability
distributions as row vectors, and test functions as column vectors; in
particular, \(\mu P^k \varphi = \mathbb{E}_{X_0 \sim \mu}[\varphi(X_k)]\).</p>
<p>We will use the terms “diffusion process” and “solution of a SDE”
interchangeably, following the remark of [1, Sec. 7.3].
This is justified more rigorously by [2, Sec. 9.7-9.9].</p>
<h1 id="sec:setting">Setting and statement of the equations</h1>
<p>Consider a particle whose spatial position \(X_t \in \mathbb{R}^d\)
evolves in time according to the SDE</p>
\[dX_t = b_t(X_t) + \sigma_t(X_t) dW_t.\]
<p>I will ignore regularity
issues, so here \(b\) and \(\sigma\) are nice smooth functions, say, with
uniformly bounded derivatives of all orders.</p>
<h4 id="the-forward-equation">The forward equation.</h4>
<p>If \(X_0 \sim \mu_0\), then \(\mu_t = \mathrm{Law}(X_t)\) is a
distributional solution of the PDE, called <em>Fokker-Planck equation</em> or
<em>Kolmogorov Forward Equation</em>,</p>
\[\label{eq:setting:FP_KFE} \tag{KFE}
\partial_t \mu_t = -\nabla \cdot [b_t \mu_t] + \frac{1}{2} \nabla^2 : [\sigma_t \sigma_t^\top \mu_t]
~~~~\text{with initial condition}~~~~
\mu_0.\]
<div class="proof" text="sketched">
<p>By definition of distributional solutions, it
suffices to check that for any \(\varphi \in C^\infty_c(\mathbb{R}^d)\) it
holds</p>
\[\frac{d}{dt} \mathbb{E}\varphi(X_t)
= \nabla \varphi(X_t) \cdot b_t(X_t)
+ \frac{1}{2} \nabla^2 \varphi(X_t) : \sigma_t(X_t) \sigma_t(X_t)^\top,\]
<p>which can be shown straightforwardly by computing \(d \varphi(X_t)\)
thanks to Ito’s formula and by taking expectations.</p>
</div>
<h4 id="the-backward-equation">The backward equation.</h4>
<p>For any fixed test function \(\varphi \in C^\infty_c\) and final time \(t\),
let</p>
\[\forall s \leq t, \forall y \in \mathbb{R}^d,~
u(y, s) = \mathbb{E}[\varphi(X_t) | X_s = y].\]
<p>Then \(u(y,s)\) is the
unique solution to the PDE, called Kolmogorov Backward Equation,</p>
\[\label{eq:setting:KBE} \tag{KBE}
-\partial_s u_s = \nabla u_s \cdot b_s + \frac{1}{2} \nabla^2 u_s : \sigma_s \sigma_s^\top
~~~~\text{with final condition}~~~~
u(\cdot, t) = \varphi(\cdot).\]
<div class="proof" text="informal">
<p>Let us check that
\(u(y, s) := \mathbb{E}[\varphi(X_t) | X_s = y]\) satisfies
\(\eqref{eq:setting:KBE}\). The final
condition \(u(\cdot, t) = \varphi(\cdot)\) is immediate from the
definition of \(u\). Next consider the process \(u(X_\tau, \tau)\) for
\(0 \leq \tau \leq t\), which by Ito’s formula evolves as</p>
\[du(X_\tau, \tau) = \left[
\partial_s u(X_\tau, \tau)
+ \partial_y u(X_\tau, \tau) \cdot b_\tau(X_\tau)
+ \frac{1}{2} \partial_{yy}^2 u(X_\tau, \tau) : \sigma_\tau(X_\tau) \sigma_\tau(X_\tau)^\top
\right] d\tau
+ [...] dW_\tau.\]
<p>Here \(\partial_s u\) denotes partial derivative of
\(u\) w.r.t. its second variable, and \(\partial_y\), \(\partial_{yy}^2\) are
its partial derivatives w.r.t. its first variable. Integrating over
\(\tau \in [s, r]\) for some fixed \(s\) and \(r \leq t\), and taking
expectations conditioned on \(X_s = y\) which we denote as
\(\mathbb{E}^{y,s} = \mathbb{E}[\cdot | X_s=y]\), we have</p>
\[\mathbb{E}^{y,s}[ u(X_r, r) - u(X_s, s) ]
= \mathbb{E}^{y,s} \int_s^r
\left[
\partial_s u(X_\tau, \tau)
+ \partial_y u(X_\tau, \tau) \cdot b_\tau(X_\tau)
+ \frac{1}{2} \partial_{yy}^2 u(X_\tau, \tau) : \sigma_\tau(X_\tau) \sigma_\tau(X_\tau)^\top
\right] d\tau.\]
<p>Now by definition of
\(u(y, s) = \mathbb{E}[\varphi(X_t) | X_s=y]\), the left-hand side
simplifies as</p>
\[u(X_r, r) - u(X_s, s)
= \mathbb{E}[\varphi(X_t) | X_r=X_r] - \mathbb{E}[\varphi(X_t) | X_s=X_s]
= 0\]
<p>and in particular its expectation is also zero. So,
differentiating the above identity w.r.t. \(r\), we have</p>
\[\forall s \leq r \leq t,~
\mathbb{E}^{y,s} \left[
\partial_s u(X_r, r)
+ \partial_y u(X_r, r) \cdot b_r(X_r)
+ \frac{1}{2} \partial_{yy}^2 u(X_r, r) : \sigma_r(X_r) \sigma_r(X_r)^\top
\right]
= 0.\]
<p>In particular evaluating at \(r=s\), we have</p>
\[\forall s \leq t,~
\partial_s u(y, s)
+ \partial_y u(y, s) \cdot b_s(y)
+ \frac{1}{2} \partial_{yy}^2 u(y, s) : \sigma_s(y) \sigma_s(y)^\top
= 0,\]
<p>which is the desired PDE
\(\eqref{eq:setting:KBE}\). (There is a typo in [1, Sec. 8.3]:
they do not introduce a free variable \(r \leq t\), and instead integrate
the SDE followed by \(u(X_\tau, \tau)\) over all of \(\tau \in [s, t]\);
then they say they differentiate w.r.t. \(t\), but here \(t\) was fixed
before even defining \(u\).)</p>
<p><strong>Conversely</strong>, the same calculations show that any (nice and regular enough) solution \(u(y,s)\) of
\(\eqref{eq:setting:KBE}\) must be equal to
\(\mathbb{E}[\varphi(X_t) | X_s=y]\). Indeed, consider the process
\(u(X_\tau, \tau)\); write down the SDE that it follows by Ito’s formula;
integrate it over \(\tau \in [s,t]\) and take expectations conditioned on
\(X_s=y\). This yields</p>
\[\mathbb{E}^{y,s}[ u(X_t, t) - u(X_s, s) ]
= \mathbb{E}^{y,s} \int_s^t
\left[
\partial_s u_\tau
+ \partial_y u_\tau \cdot b_\tau
+ \frac{1}{2} \partial_{yy}^2 u_\tau : \sigma_\tau \sigma_\tau^\top
\right](X_\tau) d\tau
= 0\]
<p>since \(u\) is a solution of
\(\eqref{eq:setting:KBE}\). Hence, by the final condition
\({ u(\cdot, t) = \varphi(\cdot) }\),</p>
<p>\(% \EE[ u(X_t, t) | X_s=y] - \EE[ u(X_s,s) | X_s=y]
\mathbb{E}^{y,s}[ u(X_t, t) - u(X_s, s) ]
= \mathbb{E}[\varphi(X_t) | X_s=y]
- u(y,s)
= 0.\)</p>
</div>
<h4 id="the-time-homogeneous-case">The time-homogeneous case.</h4>
<p>Somewhat confusingly, in the case of an autonomous process, i.e., when
\(b_t(x) = b(x)\) and \(\sigma_t(x) = \sigma(x)\) do not depend on time,
there is a different but very similar-looking way to formulate the
Kolmogorov Backward Equation. Fix again a \(\varphi \in C^\infty_c\) and
let</p>
\[\forall t \geq 0,~
\forall x \in \mathbb{R}^d,~
v(x, t) = \mathbb{E}[\varphi(X_t) | X_0 = x].\]
<p>Then \(v(x,t)\) is a
solution to the PDE</p>
\[\label{eq:setting:KBE_homog}
\partial_t v_t = \nabla v_t \cdot b + \frac{1}{2} \nabla^2 v_t : \sigma \sigma^\top
~~~~\text{with initial condition}~~~~
v(\cdot, 0) = \varphi(\cdot).\]
<p>This fact follows from the
Kolmogorov Backward Equation with an appropriate change of variable:</p>
\[\forall 0 \leq s \leq t, \forall y \in \mathbb{R}^d,~~
v(y, t-s) = \mathbb{E}[\varphi(X_{t-s}) | X_0=y] = \mathbb{E}[\varphi(X_t) | X_s=y]\]
<p>by time-homogeneity.</p>
<p>The forward and backward equations are visibly connected, which is maybe
not surprising since they both describe the same diffusion process. Our goal in the
next section is to clarify the nature of the
connection.</p>
<hr />
<h1 id="sec:markov">The Markov process point of view</h1>
<p>To sum up:
we consider a diffusion process over \(\mathbb{R}^d\) described by the SDE</p>
\[\label{eq:markov:SDE} \tag{1}
% dX_t = b(X_t, t) dt + \sigma(X_t, t) dW_t.
dX_t = b_t(X_t) dt + \sigma_t(X_t) dW_t\]
<p>and we define the associated
Kolmogorov Forward resp. Backward Equations as the PDEs</p>
\[\begin{align}
\label{eq:markov:KFE1} \tag{F1}
\partial_t \mu_t &= -\nabla \cdot [\mu_t b_t] + \frac{1}{2} \nabla^2 : [\mu_t \sigma_t \sigma_t^\top]
~~~~\text{with initial condition}~~~~
\mu_0
\\
\label{eq:markov:KBE1} \tag{B1}
\text{and}~~~~
-\partial_s u_s &= b_s \cdot \nabla u_s + \frac{1}{2} \sigma_s \sigma_s^\top : \nabla^2 u_s
\qquad
~~~~\text{with final condition}~~~~
u(\cdot, t) = \varphi(\cdot).\end{align}\]
<p>We showed above
that, if \(X_0 \sim \mu_0\) then \(\mu_t = \mathrm{Law}(X_t)\) is a solution
of \(\eqref{eq:markov:KFE1}\), and for any fixed
\(\varphi \in C^\infty_c(\mathbb{R}^d)\) and \(t > 0\),
\(u(y,s) = \mathbb{E}[\varphi(X_t) | X_s=y]\) is the unique solution of
\(\eqref{eq:markov:KBE1}\).</p>
<h4 id="the-forward-and-backward-equations-are-adjoints">The forward and backward equations are “adjoints”.</h4>
<p>Let \(\mathcal{L}_t\) the linear operator from \(C^\infty_c(\mathbb{R}^d)\)
to itself defined by</p>
\[(\mathcal{L}_t \varphi)(x)
= b_t(x) \cdot \nabla \varphi(x)
+ \frac{1}{2} \sigma_t(x) \sigma_t(x)^\top : \nabla^2 \varphi(x),\]
<p>called the <em>infinitesimal generator</em> of the diffusion process. Let
\(\mathcal{L}_t^*\) its \(L^2(\mathbb{R}^d)\) adjoint, i.e., the operator
from \(\mathcal{M}(\mathbb{R}^d)\) to itself <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> such that
\(\forall \varphi, \forall \mu, \int (\mathcal{L}_t \varphi) d\mu = \int \varphi d(\mathcal{L}_t^* \mu)\).
By explicit computations (integration by parts) one can check it is
given by</p>
\[\mathcal{L}_t^* \mu
= -\nabla \cdot [\mu b_t]
+ \frac{1}{2} \nabla^2 : [\mu \sigma_t \sigma_t^\top].\]
<p>With these
notations, \(\eqref{eq:markov:KFE1}\) and
\(\eqref{eq:markov:KBE1}\) write respectively</p>
\[\begin{align}
\partial_t \mu_t &= \mathcal{L}_t^* \mu_t
~~~~\text{with initial condition}~~~~
\mu_0
\\
\text{and}~~~~
-\partial_s u_s &= \mathcal{L}_s u_s
~~~~\text{with final condition}~~~~
u(\cdot, t) = \varphi(\cdot).\end{align}\]
<p>We have identified the sense in which the forward and backward equations
are connected: their generators (in the sense of PDEs) are adjoints of
each other, up to sign. But this still doesn’t tell me <em>why</em> they are
connected like this... To phrase it differently, it was not clear from
their interpretations as describing the evolutions of
\(\mathrm{Law}(X_t)\) resp. of \(\mathbb{E}[\varphi(X_t) | X_s=y]\), that
\(\eqref{eq:markov:KFE1}\) and
\(\eqref{eq:markov:KBE1}\) should have adjoint generators. Next we
unroll a point of view that makes it obvious that it must be the case.</p>
<h4 id="the-markov-transition-kernels">The Markov transition kernels.</h4>
<p>The solution \(X_t\) of
\(\eqref{eq:markov:SDE}\) is a Markov process, i.e.,
\(\{ X_\tau\}_{\tau>t}\) is independent of \(\{ X_\tau\}_{\tau<t}\)
conditionally on \(X_t\) for all \(t\); this can easily be checked by
inspecting the definition of solutions of SDEs. Let \(\mathcal{P}^{s,t}\)
the transition kernels of the Markov process \(X_t\), i.e., the operators
such that</p>
\[\forall s \leq t,~
\forall \varphi, \forall x,~
(\mathcal{P}^{s,t} \varphi)(x)
= \mathbb{E}[\varphi(X_t) | X_s = x].\]
<p>By definition their
\(L^2(\mathbb{R}^d)\) adjoints are given by \(\forall \mu,
\left\langle (\mathcal{P}^{s,t})^* \mu, \bullet \right\rangle
= \mathbb{E}_{X_s \sim \mu} [\bullet(X_t)]\). Equivalently and perhaps
more intuitively,</p>
\[\forall s \leq t,~~
X_s \sim \mu_s \implies X_t \sim \mu_s \mathcal{P}^{s,t} = \mu_t\]
<p>(recall that we denote indifferently \(\mu_s \mathcal{P}^{s,t}\) or
\((\mathcal{P}^{s,t})^* \mu_s\)). We can also write this symbolically, in
terms of probability density functions, as</p>
\[\mathcal{P}^{s,t}(y, dx) = \mathbb{P}(X_t \in dx | X_s = y)
= p(x,t | y,s) dx\]
<p>where \(dx\) represents a small volume around \(x\),
and
\(\mathbb{P}\left( X_t \in B | X_s \in A \right)
= \int_B dx \int_A dy~ p(x,t | y,s)\).</p>
<p>With these notations, the Kolmogorov Forward and Backward Equations
\(\eqref{eq:markov:KFE1}\),
\(\eqref{eq:markov:KBE1}\)
write</p>
\[\begin{align}
\text{for any fixed $s$},~
\forall t \geq s,~ ~~~~
\partial_t p(\cdot,t | y,s) &= \mathcal{L}_t^* p(\cdot,t | y,s)
~~~~\text{w/ initial cond.}~~~~
p(\cdot,s | y,s) = \delta_y(\cdot),
\label{eq:markov:KFE2} \tag{F2} \\
\text{for any fixed $t$},~
\forall s \leq t,~ ~~
-\partial_s p(x,t | \cdot,s) &= \mathcal{L}_s p(x,t | \cdot,s)
\quad
~~\text{w/ final cond.}~~~~
p(x,t | \cdot,t) = \delta_x(\cdot).
\label{eq:markov:KBE2} \tag{B2}
\end{align}\]
<p>It takes a few minutes of
focus to check that the above formulas do indeed have the same meaning
as the interpretations we gave for
\(\eqref{eq:markov:KFE1}\) and
\(\eqref{eq:markov:KBE1}\). It is worthwhile to stop and actually
check.
For \(\eqref{eq:markov:KBE1}\)/\(\eqref{eq:markov:KBE2}\), it can be helpful to
start by testing both sides of \(\eqref{eq:markov:KBE2}\) against some fixed \(\varphi \in C^\infty_c\)
(i.e., multiply by \(\varphi(x)\) and integrate w.r.t. \(x\), for each \(s\)).</p>
<p>With the above reformulations, I’m almost satisfied. Indeed <strong>the above
equations are particular instances of general identities for (“regular enough”) Markov processes</strong>.
It only remains to give an explanation of those general identities.
For any Markov process with transition kernels \(\mathcal{P}^{s,t}\),
note that by definition (of Markov processes
and of the transition kernels), we have the <em>Chapman-Kolmogorov equation</em></p>
\[\label{eq:markov:chapman_kolmo}
\forall s \leq \tau \leq t,~
\mathcal{P}^{s,\tau} \mathcal{P}^{\tau,t} (y,dx)
= \int_{\mathbb{R}^d} \mathcal{P}^{s,\tau}(y, dz) \mathcal{P}^{\tau,t}(z, dx)
= \mathcal{P}^{s,t}(y,dx).\]
<p>The idea is that by differentiating
these identities, one obtains differential equations that characterize
the Markov process. To explain this in a non-confusing way, I find it
helpful to focus on the discrete-space setting.</p>
<h4 id="finite-state-space-heuristic">Finite-state-space heuristic.</h4>
<p>For the duration of this paragraph, pretend that the space
\(\mathbb{R}^d\) is discrete and even finite.
Instead of
\(p(x,t | y,s)\) with \(\int_{\mathbb{R}^d} p(x,t | y,s) dx = 1\), we will write the transition
probabilities \(\mathbb{P}(X_t=x | X_s=y)\) as \(p^{s,t}_{yx}\) with
\(\sum_{x \in \mathbb{R}^d} p^{s,t}_{yx} = 1\). Then we can write the
Chapman-Kolmogorov equation in matrix notation as</p>
\[\label{eq:markov:champan_kolmo_discrete} \tag{2}
\forall s \leq \tau \leq t,~~
\sum_{z \in \mathbb{R}^d} p^{s,\tau}_{yz}~ p^{\tau,t}_{zx}
= p^{s,t}_{yx}, \forall y,x
~~~~\text{i.e.}~~~~
p^{s,\tau} p^{\tau,t}
= p^{s,t}.\]
<p>Assume the following quantities \([Q_s]_{ij}\) exist (see
[1, Sec. 3.5, Eq. (3.11)-(3.12)] for sufficient
conditions in terms of the \(p^{s,t}_{ij}\)):</p>
\[Q_s
= {\left.\frac{\partial}{\partial t} p^{s,t}\right|_{t=s}}
= \lim_{h \downarrow 0} \frac{p^{s,t+h} - p^{s,t}}{h}.\]
<p>The matrix
\(Q_s\) is called the <em>generator</em> of the Markov jump process with
transition kernels \(p^{s,t}\), and it turns out to completely
characterize the Markov process. Now consider the following matrix
identities, which are just rewritings
of \(\eqref{eq:markov:champan_kolmo_discrete}\):</p>
\[\forall s \leq t, \forall h \geq 0,~~~
\begin{cases}
p^{s,t}~ p^{t,t+h} = p^{s, t+h} \\
p^{s,s+h}~ p^{s+h,t} = p^{s,t}.
\end{cases}\]
<p>By differentiating w.r.t. \(h\) and evaluating at \(h=0\),
we obtain</p>
\[\begin{cases}
p^{s,t}~ Q_t = \partial_t p^{s,t} \\
Q_s~ p^{s,t} + p^{s,s} \partial_s p^{s,t} = 0
\end{cases}
~~~~~~\text{i.e.,}~~~~
\begin{cases}
\partial_t p^{s,t} = p^{s,t}~ Q_t \\
-\partial_s p^{s,t} = Q_s~ p^{s,t}
\end{cases}\]
<p>since \(p^{s,s}_{yx} = \mathbb{1}_{y=x}\) is the
identity matrix. These equations are exactly the Kolmogorov Forward and
Backward Equations for Markov jump processes. <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> Note the similarity
with the corresponding equations for diffusion
processes \(\eqref{eq:markov:KFE2}\), \(\eqref{eq:markov:KBE2}\)!</p>
<p>In this post, we started from a SDE and showed that its transition
density function satisfies the PDEs with generators \(\mathcal{L}_t^*\)
resp. \(-\mathcal{L}_s\); we then interpreted those equations as particular instances
of the general Kolmogorov Forward and Backward Equations. One can also go
in the other direction and show that a continuous Markov process with
\(\mathcal{L}_t\) as the generator (in the sense of stochastic processes)
can be represented by a SDE with corresponding drift \(b_t\) and diffusion
\(\sigma_t\) coefficients. This alternative direction is nicely presented
in [2, Chapters 6-10].</p>
<hr />
<p><strong>References</strong></p>
<p>[1] Weinan, E., Tiejun Li, and Eric Vanden-Eijnden. <a href="https://www.ams.org/books/gsm/199/gsm199-endmatter.pdf"><em>Applied stochastic analysis</em></a>. Vol. 199. American Mathematical Soc., 2021.</p>
<p>[2] Baldi, Paolo. <a href="https://link.springer.com/book/10.1007/978-3-319-62226-2"><em>Stochastic calculus</em></a>. Springer International Publishing, 2017.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Again, I am being extremely loose with questions of regularity: it
doesn’t make too much sense to consider \(\mathcal{L}_t\) as an
operator over \(C^\infty_c(\mathbb{R}^d)\) and \(\mathcal{L}_t^*\) as an
operator over \(\mathcal{M}(\mathbb{R}^d)\). The important thing is
that \(\mathcal{L}_t\) morally acts on test functions, and
\(\mathcal{L}_t^*\) on probability distributions. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>The Wikipedia articles on this subject are a little bit messy
currently, there are three different pages titled “Kolmogorov
equations”. The relevant one here is
<a href="https://en.wikipedia.org/w/index.php?title=Kolmogorov_equations_(continuous-time_Markov_chains)&oldid=1156787598">https://en.wikipedia.org/w/index.php?title=Kolmogorov_equations_(continuous-time_Markov_chains)&oldid=1156787598</a>. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Guillaume WangI once struggled quite a lot to wrap my head around convex duality. I wouldn’t go as far as to say I understand it now, but at least I got used to it, and I feel like it “makes sense”. However I was recently again in a similar situation, with diffusion processes and specifically the Kolmogorov backward equation: I could follow its derivation on a formal level, but I had a hard time understanding what it means. In this post I write down some calculations that were somewhat helpful to me, to make sense of it.Regularized linear models and the Fenchel-Rockafellar duality theorem (III): Classification with the exponential loss2021-12-15T00:00:00+00:002021-12-15T00:00:00+00:00https://guillaumew16.github.io/math/2021/12/15/FRDT_exp_classif<blockquote>
<p>This is the third of a series of posts on optimization of regularized linear models through the lens of duality.
See the first one <a href="/math/2021/08/27/FRDT_generalities.html">here</a>.</p>
</blockquote>
<ul id="markdown-toc">
<li><a href="#derivation-of-the-two-variants-of-gradient-step" id="markdown-toc-derivation-of-the-two-variants-of-gradient-step">Derivation of the two variants of gradient step</a> <ul>
<li><a href="#the-dual-accelerated-method" id="markdown-toc-the-dual-accelerated-method">The dual accelerated method.</a></li>
</ul>
</li>
<li><a href="#beyond-ell_2-algorithms-for-ell_1-regularized-classification-and-acceleration" id="markdown-toc-beyond-ell_2-algorithms-for-ell_1-regularized-classification-and-acceleration">Beyond \(\ell_2\): algorithms for \(\ell_1\)-regularized classification, and acceleration</a> <ul>
<li><a href="#fista-for-ell_1-penalized-classification" id="markdown-toc-fista-for-ell_1-penalized-classification">(F)ISTA for \(\ell_1\)-penalized classification</a> <ul>
<li><a href="#primal-acceleration-fista" id="markdown-toc-primal-acceleration-fista">Primal acceleration: FISTA.</a></li>
</ul>
</li>
<li><a href="#adaboost-for-implicit-ell_1-regularized-classification" id="markdown-toc-adaboost-for-implicit-ell_1-regularized-classification">AdaBoost for implicit \(\ell_1\)-regularized classification</a> <ul>
<li><a href="#dually-accelerated-adaboost" id="markdown-toc-dually-accelerated-adaboost">Dually accelerated AdaBoost</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<p>We will continue with the notation from last times, in particular:</p>
<ul>
<li>the primal problem is
\(\label{eq:FRDT_primal} \tag{P}
\min_{w \in \mathcal{W}} \Psi(w) + \mathcal{L}(V w) =: P(w)\)</li>
<li>the dual problem is
\(\label{eq:FRDT_dual} \tag{D}
\max_{a \in \mathcal{Y}^*} - \Psi^*(-V^* a) - \mathcal{L}^*(a) =: D(a).\)</li>
</ul>
<p>In a series of recent works, Ziwei Ji and Matus Telgarsky studied the
optimization of linear models for classification with the exponential
loss (and exponential-like losses), making use of duality arguments
<a href="http://arxiv.org/abs/2107.00595">(Ji, Srebro and Telgarsky, 2021)</a>. Personally I found their
derivations a bit heavy on duality black magic, so I spent a bit of time
to understand what was going on. Unsurprisingly, their notion of dual
variable is exactly the same as the variable \(a\) of the dual problem
\(\eqref{eq:FRDT_dual}\). But it actually took me a while to realize
that, for a relatively subtle reason.</p>
<p>In this section, we adopt some of the notation from Ziwei Ji and Matus
Telgarsky’s papers, on top of the generic ones already used so far.</p>
<ul>
<li>
<p>Let \(\ell(u) = \exp(u)\) be the exponential loss.</p>
</li>
<li>
<p>We may assume WLOG that \(y^{\text{tgt}}_i = -1\) for all \(i\), since
we may transform the dataset \((\phi(x_i), y^{\text{tgt}}_i)_i\) into
the equivalent dataset \((z_i, -1)_i\) with
\(z_i = -y^{\text{tgt}}_i \phi(x_i)\). So the data-fitting term
\(\mathcal{L}(y) = \sum_i \ell(-y^{\text{tgt}}_i y_i)\), is simply
\(\mathcal{L}(y) = \sum_i \ell(y_i)\).</p>
</li>
<li>
<p>The (unnormalized) empirical risk of a parameter \(w\) is defined as
\(\mathcal{R}(w) = \sum_{i=1}^n \ell(\left\langle w, z_i \right\rangle)\).</p>
</li>
</ul>
<p>For classification tasks, it is common to consider as data-fitting term</p>
\[\mathcal{L}(y) = \sum_{i=1}^n \ell(y_i).\]
<p>A common trick when
analyzing learning algorithms for classification, is that \(\ell\) is a
strictly increasing function so that we may use a different choice for
the data-fitting term:
\(\widetilde{\mathcal{L}}= (\ell^{-1}) \circ \mathcal{L}\), i.e</p>
\[\widetilde{\mathcal{L}}(y) = \ell^{-1} \left( \sum_{i=1}^n \ell(y_i) \right).\]
<p>Plus, for \(\ell = \exp\), \(\widetilde{\mathcal{L}}\) is just the
log-sum-exp function
\(\widetilde{\mathcal{L}}(y) = \log \sum_{i=1}^n e^{y_i}\), which is
convex. <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>
<p>It turns out that those two seemingly equivalent choices lead to
slightly different optimization algorithms, with significantly different
convergence speeds <a href="https://arxiv.org/abs/1906.04540">(Ji and Telgarsky, 2020)</a>.</p>
<ol>
<li>
<p>Associated with the choice of \(\mathcal{L}\) is the vanilla gradient
step</p>
\[w_{t+1} = w_t - \eta_t \nabla \mathcal{R}(w).\]
</li>
<li>
<p>Associated with the choice of \(\widetilde{\mathcal{L}}\) is the
normalized gradient step</p>
\[w_{t+1} = w_t - \eta_t \frac{\nabla \mathcal{R}(w)}{\mathcal{R}(w)}.\]
</li>
</ol>
<h2 id="derivation-of-the-two-variants-of-gradient-step">Derivation of the two variants of gradient step</h2>
<p>Let a convex regularizer \(\Psi(w)\). Let us naively write down the update
rules from for the saddle-point formulation of the optimization problems
\(\min_w \Psi(w) + \mathcal{L}(Vw)\) and
\(\min_w \Psi(w) + \widetilde{\mathcal{L}}(Vw)\). Note that:</p>
<ul>
<li>
<p>Since \(\mathcal{L}(Vw) = \mathcal{R}(w)\),</p>
\[\partial_w \mathcal{L}(Vw)
= V^* \partial \mathcal{L}(Vw)
= \nabla \mathcal{R}(w).\]
</li>
<li>
<p>Since \(\ell^{-1}(v) = \log(v)\) and so
\(\widetilde{\mathcal{L}}= \log \circ \mathcal{L}\),</p>
\[\partial_w \widetilde{\mathcal{L}}(Vw)
= V^* \partial \widetilde{\mathcal{L}}(Vw)
= \frac{\nabla \mathcal{R}(w)}{\mathcal{R}(w)}.\]
</li>
</ul>
<p>Now consider using the scheme from with gradient descent steps for
\(w_{t+1}\) and fully-optimizing for \(a_{t+1}\). We get the update rule</p>
\[w_{t+1}
= w_t - \eta_t \left[ \partial \Psi(w_t) + V^* a_t \right]
= w_t - \eta_t V^* \partial \mathcal{L}(Vw) - \eta_t \partial \Psi(w_t)\]
<p>and similarly with \(\widetilde{\mathcal{L}}\). Plugging in the values of
\(V^* \partial \mathcal{L}(Vw)\) and
\(V^* \partial \widetilde{\mathcal{L}}(Vw)\), we see that we get almost
exactly the vanilla and basic gradient steps from above; the only
difference is that we get an extra term \(- \eta_t \partial \Psi(w_t)\).
When \(\Psi = \frac{\lambda}{2} \left\lVert \cdot \right\rVert_2^2\), then
as discussed in , a cheap heuristic for implicit regularization (i.e
\(\lambda \to 0\)) is to simply remove that extra term.</p>
<p>It looks like we didn’t do anything else than write down the classical
primal gradient descent steps on the unregularized losses
\(\mathcal{L}(Vw)\) and \(\widetilde{\mathcal{L}}(Vw)\). That is true. The
advantage of invoking the dual space in this context is that it allows a
finer convergence analysis than if we only stay in the primal
<a href="https://arxiv.org/abs/1906.04540">(Ji and Telgarsky, 2020)</a>. It also leads naturally to a dual accelerated
method, discussed next, that would otherwise appear as utter magic. It
might even allow to divinate yet other funky update rules, by using
other choices for the dual update, or by replacing the exponential by
some other surrogate loss.</p>
<h4 id="the-dual-accelerated-method">The dual accelerated method.</h4>
<p>In <a href="http://arxiv.org/abs/2107.00595">(Ji, Srebro and Telgarsky, 2021)</a>, they propose a dual-accelerated method for the same
problem (Algorithm 1 of the paper). To present it would require a
discussion of accelerated mirror descent, which would take us a bit far.
Let us only say that their method is essentially just a variant of what
we called the “fully dual approach” <a href="/math/2021/09/03/FRDT_zoo_primal_dual.html#duality-gap-formulation-and-fully-dual-approach-the-frank-wolfe-algorithm">last time</a>, with mirror descent replaced by
a form of accelerated mirror descent.</p>
<p>Interestingly, their new method can also be interpreted as an instance
of the general mix-and-match scheme of , with what seems to be an
unusual form of accelerated gradient descent for \(w_{t+1}\). However this
point of view is not the one they used to derive and analyze their
method. I find it interesting, and pretty confusing, that a method
derived by acceleration in the dual can be interpreted as a primal-dual
method with acceleration in the primal.</p>
<h2 id="beyond-ell_2-algorithms-for-ell_1-regularized-classification-and-acceleration">Beyond \(\ell_2\): algorithms for \(\ell_1\)-regularized classification, and acceleration</h2>
<h3 id="fista-for-ell_1-penalized-classification">(F)ISTA for \(\ell_1\)-penalized classification</h3>
<p>Consider the same optimization problem as before:
\(\min_w \Psi(w) + \widetilde{\mathcal{L}}(Vw)\), this time with the
choice of regularizer \(\Psi(w) = \lambda \left\lVert w \right\rVert_1\).
Consider using the scheme from with fully-optimizing for \(a_{t+1}\) and
proximal gradient descent for \(w_{t+1}\). We get the update rule</p>
\[\begin{aligned}
a_{t+1} &= \nabla \widetilde{\mathcal{L}}(Vw_t) \\
w_{t+1}
&= \mathop{\mathrm{prox}}_{\tau \Psi}(w_t - \tau V^* a_{t+1})\end{aligned}\]
<p>Since \(\Psi = \lambda \left\lVert w \right\rVert_1\), this is simply the
ISTA algorithm applied to \(\widetilde{\mathcal{L}}(Vw)\). <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
<h4 id="primal-acceleration-fista">Primal acceleration: FISTA.</h4>
<p>Accelerated proximal gradient descent in the primal.</p>
\[\begin{aligned}
a_{t+1} &= \nabla \widetilde{\mathcal{L}}(V \overline{\gamma}_t) \\
w_{t+1} &= \mathop{\mathrm{prox}}_{\tau \Psi} \left( \overline{\gamma}_t - \tau V^* a_{t+1} \right) \\
\overline{\gamma}_{t+1} &= w_{t+1} + \theta (w_{t+1} - w_t)\end{aligned}\]
<p>which reduces to</p>
\[\begin{aligned}
w_{t+1} &= \mathop{\mathrm{prox}}_{\tau \Psi} \left( \overline{\gamma}_t - \tau
\left.\nabla_w \widetilde{\mathcal{L}}(V w)\right|_{\overline{\gamma}_t}
\right) \\
\overline{\gamma}_{t+1} &= w_{t+1} + \theta (w_{t+1} - w_t)\end{aligned}\]
<p>Notice that there is another way to accelerate, by updating \(w_{t+1}\)
starting from \(w_t\) instead of \(\overline{\gamma}_t\):</p>
\[\begin{aligned}
a_{t+1} &= \nabla \widetilde{\mathcal{L}}(V \overline{\gamma}_t) \\
w_{t+1} &= \mathop{\mathrm{prox}}_{\tau \Psi} \left( w_t - \tau V^* a_{t+1} \right) \\
\overline{\gamma}_{t+1} &= w_{t+1} + \theta (w_{t+1} - w_t)\end{aligned}\]
<p>which reduces to
\(\begin{aligned}
w_{t+1} &= \mathop{\mathrm{prox}}_{\tau \Psi} \left( w_t - \tau
\left.\nabla_w \widetilde{\mathcal{L}}(V w)\right|_{\overline{\gamma}_t}
\right) \\
\overline{\gamma}_{t+1} &= w_{t+1} + \theta (w_{t+1} - w_t)\end{aligned}\)</p>
<p>This method can be viewed as the Chambolle-Pock algorithm with
\(\sigma = +\infty\).</p>
<h3 id="adaboost-for-implicit-ell_1-regularized-classification">AdaBoost for implicit \(\ell_1\)-regularized classification</h3>
<p>It is well-known that AdaBoost results in \(\ell_1\)-margin maximization
<a href="https://arxiv.org/abs/2105.02083">(Chinot, Kuchelmeister, Löffler and van de Geer, 2021)</a>. In this paragraph, we heuristically recover
that fact, by interpreting AdaBoost as (almost) an instance of an
algorithm previously derived in the framework of FRDT.</p>
<p>We consider AdaBoost as stated in Algorithm 1 of
<a href="https://arxiv.org/abs/2105.02083">(Chinot, Kuchelmeister, Löffler and van de Geer, 2021)</a>. <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> With our notation, one can check that the
algorithm can be formulated as:</p>
\[\begin{aligned}
w_0 &= 0 \\
a_t &= \nabla \widetilde{\mathcal{L}}(V w_t) \\
u_t &= \left\lVert V^* a_t \right\rVert_\infty \partial \left\lVert \cdot \right\rVert_\infty(-V^* a_t)
= \partial \frac{1}{2} \left\lVert \cdot \right\rVert_\infty^2 (-V^* a_t)
= \partial \left( \frac{1}{2} \left\lVert \cdot \right\rVert_1^2 \right)^* (-V^* a_t) \\
w_{t+1} &= w_t + \eta u_t\end{aligned}\]
<p>Denote
\(\varphi: \left[ \mathbb{R}\to \mathbb{R}, x \mapsto \frac{x^2}{2} \right]\)
and \(\psi(w) = \varphi(\left\lVert w \right\rVert_1)\). In the above
algorithm, the equation for \(u_t\) can be written as
\(u_t = \partial \psi^*(-V^* a_t)\), and more generally we have that for
even and convex \(\varphi\) <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></p>
\[\begin{gathered}
\psi^* = \varphi^* \circ \left\lVert \cdot \right\rVert_\infty, \\
u_t = \partial \psi^*(-V^* a_t)
= (\varphi^*)'(\left\lVert -V^* a_t \right\rVert_\infty)~ \partial \left\lVert \cdot \right\rVert_\infty(-V^* a_t).\end{gathered}\]
<p>So by using different choices for the scalar mapping \(\varphi\), we
obtain different choices for the adaptive stepsize. We may expect
AdaBoost to have similar regularization behavior for all of them.</p>
<p>Note that AdaBoost is thus strongly reminiscent of the Frank-Wolfe-like
method obtained by what we called the “<a href="/math/2021/09/03/FRDT_zoo_primal_dual.html#duality-gap-formulation-and-fully-dual-approach-the-frank-wolfe-algorithm">fully dual</a> approach”:</p>
\[\begin{aligned}
w_0, a_0 & ~\text{such that}~ a_0 \in \partial \widetilde{\mathcal{L}}(Vw_0) \\
a_t &= \nabla \widetilde{\mathcal{L}}(V w_t) \\
w_{t+1} &= (1-\eta) w_t + \eta \frac{1}{\lambda} \nabla \psi^*(-V^* a_t)\end{aligned}\]
<p>with \(\lambda \to 0\). The only difference is that AdaBoost updates
\(w_{t+1}\) from \(w_t\) and \(u_t\) via an additive step instead of a convex
combination. However since \(\lambda \to 0\), the update
\(\eta \frac{1}{\lambda} \nabla\psi^*(-V^* a_t)\) can be expected to have
large magnitude so that the \(-\eta w_t\) term makes no big difference
anyway.</p>
<h4 id="dually-accelerated-adaboost">Dually accelerated AdaBoost</h4>
<p>Armed with the above almost-interpretation of AdaBoost as a previously
derived method, we may derive an accelerated version of AdaBoost. This
would require a discussion of accelerated mirror descent, which would
take us a bit far. Let us only point out that all the necessary
ingredients are contained in Appendix B of <a href="http://arxiv.org/abs/2107.00595">(Ji, Srebro and Telgarsky, 2021)</a>. Namely, I think
the only adaptation needed is to replace \(-V^* a_t\) (\(-Z^\top q_t\) in
their notation) by \(u_t = \partial \psi^*(-V^* a_t)\) everywhere in their
Algorithm 1.</p>
<p>In fact, I expect that deriving and obtaining guarantees for fast
\(\ell_1\)-margin maximization is a very straightforward task, by making
the appropriate adaptations in the proofs of that paper.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Beyond the exponential loss, the same trick can be applied for
other choices of surrogate loss \(\ell\). A crucial condition for the
trick is that \(\widetilde{\mathcal{L}}\) must be convex; additional
desirable conditions are described in Assumption 1.2 of
<a href="https://arxiv.org/abs/1906.04540">(Ji and Telgarsky, 2020)</a>, where they also give sufficient
conditions in their Lemma 5.2. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>The subtlety that confused me for a while, is that choosing
\(\mathcal{L}\) vs. \(\widetilde{\mathcal{L}}\) as the data-fitting term
leads to different notions of dual variable,
\(a = \nabla \mathcal{L}(Vw)\)
vs. \(\tilde{a}= \nabla \widetilde{\mathcal{L}}(Vw) = (\ell^{-1})'(\mathcal{L}(Vw))~ a\).
I initially only had the choice of \(\mathcal{L}\) in mind, so as I
stared at the derivations of <a href="https://arxiv.org/abs/1906.04540">(Ji and Telgarsky, 2020)</a>, I could not
understand why they considered renormalizing the stepsize by
\((\ell^{-1})'(\mathcal{L}(Vw_t))\) in the dual. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p><a href="https://blogs.princeton.edu/imabandit/2013/04/11/orf523-ista-and-fista/">https://blogs.princeton.edu/imabandit/2013/04/11/orf523-ista-and-fista/</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>Our discussion extends immediately to a number of variants of
AdaBoost: logistic instead of exponential loss, and various choices
of adaptive stepsize (see the paragraph just below Algorithm 1 in
<a href="https://arxiv.org/abs/2105.02083">(Chinot, Kuchelmeister, Löffler and van de Geer, 2021)</a>). <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>See Example 13.8 in the book <em>Convex Analysis and Monotone Operator Theory in Hilbert Spaces</em> by Bauschke and Combettes, 2017. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Guillaume WangThis is the third of a series of posts on optimization of regularized linear models through the lens of duality. See the first one here.Regularized linear models and the Fenchel-Rockafellar duality theorem (II): A zoo of primal-dual methods2021-09-03T00:00:00+00:002021-09-03T00:00:00+00:00https://guillaumew16.github.io/math/2021/09/03/FRDT_zoo_primal_dual<blockquote>
<p>This is the second of a series of posts on optimization of regularized linear models through the lens of duality.
See the first one <a href="/math/2021/08/27/FRDT_generalities.html">here</a>.</p>
</blockquote>
<ul id="markdown-toc">
<li><a href="#saddle-point-formulation-mix-and-match-primal-and-dual-updates" id="markdown-toc-saddle-point-formulation-mix-and-match-primal-and-dual-updates">Saddle-point formulation: “mix-and-match” primal and dual updates</a> <ul>
<li><a href="#examples" id="markdown-toc-examples">Examples</a></li>
</ul>
</li>
<li><a href="#duality-gap-formulation-and-fully-dual-approach-the-frank-wolfe-algorithm" id="markdown-toc-duality-gap-formulation-and-fully-dual-approach-the-frank-wolfe-algorithm">Duality-gap formulation and fully dual approach: the Frank-Wolfe algorithm</a> <ul>
<li><a href="#a-trick-to-enforce-the-wrong-kkt-condition" id="markdown-toc-a-trick-to-enforce-the-wrong-kkt-condition">A trick to enforce the “wrong” KKT condition</a></li>
<li><a href="#relation-to-frank-wolfe" id="markdown-toc-relation-to-frank-wolfe">Relation to Frank-Wolfe</a></li>
<li><a href="#equivalence-to-a-fully-dual-approach" id="markdown-toc-equivalence-to-a-fully-dual-approach">Equivalence to a fully dual approach</a></li>
</ul>
</li>
</ul>
<p>We will continue with the notation from last time, in particular:</p>
<ul>
<li>the primal problem is
\(\label{eq:FRDT_primal} \tag{P}
\min_{w \in \mathcal{W}} \Psi(w) + \mathcal{L}(V w) =: P(w)\)</li>
<li>the dual problem is
\(\label{eq:FRDT_dual} \tag{D}
\max_{a \in \mathcal{Y}^*} - \Psi^*(-V^* a) - \mathcal{L}^*(a) =: D(a).\)</li>
<li>
<p>the KKT conditions are
\(\label{eq:FRDT_KKT_Psi} \tag{KKT_Psi}
w \in \partial \Psi^*(-V^* a)
~~\text{i.e}~~
-V^* a \in \partial \Psi(w)\)</p>
\[\label{eq:FRDT_KKT_L} \tag{KKT_L}
a \in \partial \mathcal{L}(V w)
~~\text{i.e}~~
V w \in \partial \mathcal{L}^*(a)\]
</li>
</ul>
<h2 id="saddle-point-formulation-mix-and-match-primal-and-dual-updates">Saddle-point formulation: “mix-and-match” primal and dual updates</h2>
<p>The FRDT essentially tells us that the problem of fitting a regularized
linear model to data, the minimization problem
\(\eqref{eq:FRDT_primal}\), can be formulated as a saddle-point
problem:</p>
\[\min_{w \in \mathcal{W}} \max_{a \in \mathcal{Y}^*}~
F(w,a) =:
\Psi(w) + \left\langle Vw, a \right\rangle - \mathcal{L}^*(a)\]
<p>Let us naively think about how to iteratively solve that saddle-point
problem; that is, how to choose an update rule for the joint variable
\((w_t, a_t)\). I can think of four reasonable update rules for \(w_{t+1}\),
given \((w_t,a_t)\):</p>
<ul>
<li>
<p><strong>Fully optimize for fixed \(a\):</strong> By definition, for a fixed value
of \(a\), the \(\mathop{\mathrm{arg\,min}}\) of the objective over \(w\)
is computable in closed form as</p>
\[\mathop{\mathrm{arg\,min}}_{w'} F(w',a)
= \mathop{\mathrm{arg\,min}}_{w'} \Psi(w') + \left\langle w', V^* a \right\rangle
= \mathop{\mathrm{arg\,max}}_{w'} \left\langle w', -V^* a \right\rangle - \Psi(w')
= \partial \Psi^*(-V^* a).\]
<p>So we may take as update rule:</p>
\[w_{t+1} \in \partial \Psi^*(-V^* a_t)\]
<p>(This can be interpreted as enforcing the condition
\(\eqref{eq:FRDT_KKT_Psi}\) throughout the optimization procedure.)
However, this update rule may not be computationally feasible.</p>
</li>
<li>
<p><strong>Gradient descent step:</strong> Instead of fully optimizing, we may take
one (or several) (sub)gradient descent step(s) for
\(\mathop{\mathrm{arg\,min}}_{w'} F(w',a_t)\) starting from \(w_t\).
This gives the update rule:</p>
\[w_{t+1} \in w_t - \eta_t \partial_w F(w_t,a_t)
= w_t - \eta_t \left[ \partial \Psi(w_t) + V^* a_t \right]\]
</li>
<li>
<p><strong>Mirror descent step using \(\Psi\):</strong> Instead of gradient descent,
we may take one step of the other basic optimization primitive that
is mirror descent. A natural candidate for the link function is
\(\Psi\) (assuming it is strictly convex and differentiable
everywhere), yielding the update rule</p>
\[\nabla \Psi(w_{t+1}) = \nabla \Psi(w_t) - \eta_t \left[ \nabla \Psi(w_t) + V^* a_t \right]\]
</li>
<li>
<p><strong>Proximal gradient step:</strong> Instead of mirror descent, we may take
one step of the third basic optimization primitive that is proximal
gradient descent. It is arguably more natural than mirror descent
since the objective if composite. This gives the update rule</p>
\[w_{t+1} = \mathrm{prox}_{\eta_t \Psi} \left(
w_t - \eta_t V^* a_t
\right)\]
</li>
</ul>
<p>As for update rules for \(a_{t+1}\) given \((w_t,a_t)\), the four same ideas
apply.</p>
<ul>
<li>
<p><strong>Fully optimize for fixed \(w\):</strong>
\(a_{t+1} \in \partial \mathcal{L}(V w_t)\)</p>
</li>
<li>
<p><strong>Gradient descent step:</strong>
\(a_{t+1} \in a_t + \sigma_t \left[ - \partial \mathcal{L}^*(a_t) + V w_t \right]\)</p>
</li>
<li>
<p><strong>Mirror descent step using \(\mathcal{L}^*\):</strong>
\(\nabla \mathcal{L}^*(a_{t+1}) = \nabla \mathcal{L}^*(a_t) + \sigma_t \left[ - \nabla \mathcal{L}^*(a_t) + V w_t \right]\)</p>
</li>
<li>
<p><strong>Proximal gradient step:</strong>
\(a_{t+1} = \mathrm{prox}_{\sigma_t \mathcal{L}^*} \left(
a_t + \sigma_t V w_t
\right)\)</p>
</li>
</ul>
<p>Thus, this naive reasoning gives a set of optimization schemes that one
can try: simply mix-and-match the choice of update rules for \(w_{t+1}\)
and for \(a_{t+1}\), and apply the updates alternatingly. By alternating
updates we mean: \(w_{t+1} = \text{Update}_w(w_t,a_t)\),
\(a_{t+1} = \text{Update}_a(w_{t+1},a_t)\). One can even consider using
joint updates: \(w_{t+1} = \text{Update}_w(w_t,a_t)\),
\(a_{t+1} = \text{Update}_a(w_t,a_t)\).</p>
<h3 id="examples">Examples</h3>
<ul>
<li>
<p><strong>Fully-optimizing in the dual recovers vanilla primal methods.</strong>
Indeed if we substitute \(a_t\) by \(\partial \mathcal{L}(V w_t)\) in the
primal update rules, then the term \(V^* a_t\) becomes</p>
\[V^* \partial \mathcal{L}(V w_t)
= \left.
\partial_w \mathcal{L}(V w)
\right|_{w_t}\]
<p>the subgradient of the data-fitting term w.r.t the primal variable.</p>
</li>
<li>
<p><strong>Proximal gradient steps in the dual.</strong>
The algorithm consisting of alternating proximal gradient steps both for
\(w_{t+1}\) and for \(a_{t+1}\), is called the Arrow-Hurwicz method.
The well-known Chambolle-Pock algorithm
can be interpreted as a fancier version of this scheme, whereby proximal
gradient steps for \(a_{t+1}\) are alternated with a form of accelerated
proximal gradient for \(w_{t+1}\): <a href="https://hal.archives-ouvertes.fr/hal-00490826/document">(Chambolle and Pock, 2011)</a></p>
\[\begin{aligned}
a_{t+1} &= \mathrm{prox}_{\sigma \mathcal{L}^*} \left( a_t + \sigma V {\overline{w}}_t \right) \\
w_{t+1} &= \mathrm{prox}_{\tau \Psi} \left( w_t - \tau V^* a_{t+1} \right) \\
{\overline{w}}_{t+1} &= w_{t+1} + \theta (w_{t+1} - w_t)\end{aligned}\]
<p>and the parameters \(\sigma, \tau, \theta\) can further be made to depend
on \(t\).</p>
<p>As a second example, the proximal dual coordinate ascent algorithm
proposed in <a href="https://arxiv.org/abs/2003.13807">(Raj and Bach, 2021)</a>, and its accelerated variant, are also
instances of the scheme explained above. There, the primal variables are
fully optimized (\(w_t = \nabla \Psi^*(-V^* a_t)\)), and the dual
variables are updated by a proximal gradient step. That paper focuses on
the specific case of min-\(\Psi\)-interpolation, so \(\mathcal{L}^*\)
consists in a sum of indicator functions, and they use an explicit
expression for \(\mathrm{prox}_{\mathcal{L}^*}\).</p>
</li>
<li>
<p><strong>Kernel methods (RKHS).</strong>
For (Hilbert) kernel methods, \(\mathcal{W}\) is the RKHS and the
regularizer is
\(\Psi(w) = \frac{\lambda}{2} \left\lVert w \right\rVert^2\). So
\(\partial \Psi^*(-V^* a_t) = -\lambda V^* a_t\), and one may
fully-optimize in the primal and run e.g gradient descent entirely in
the dual. In function space this corresponds to parametrizing the model
as \(f = \sum_{i=1}^n a_i k(\cdot, x_i)\) and running gradient descent on
the coefficients \(a_i\).</p>
</li>
</ul>
<h2 id="duality-gap-formulation-and-fully-dual-approach-the-frank-wolfe-algorithm">Duality-gap formulation and fully dual approach: the Frank-Wolfe algorithm</h2>
<p>For a given pair of variables \((w,a)\), we call <em>duality gap</em> the
quantity \(P(w) - D(a)\). Since
\(P(w) \geq P_{\text{opt}} \geq D_{\text{opt}} \geq D(a)\), the duality
gap is non-negative, and provides an optimality certificate for the
primal: \(P(w) - P_{\text{opt}} \leq P(W) - D(a)\). So instead of solving
the saddle-point problem \(\min_w \max_a F(w,a)\), we may consider solving
the duality-gap minimization problem</p>
\[\min_{w \in \mathcal{W}} \min_{a \in \mathcal{Y}^*} P(w) - D(a).\]
<p>Observe that the duality gap can be split into two terms as:</p>
\[\begin{aligned}
P(w) - D(a)
&= \left[ P(w) - F(w,a) \right]
+ \left[ F(w,a) - D(a) \right] \\
&= \left[ \mathcal{L}^*(a) + \mathcal{L}(Vw) - \left\langle Vw, a \right\rangle \right]
+ \left[ \Psi^*(-V^*a) + \Psi(w) - \left\langle w, -V^* a \right\rangle \right].\end{aligned}\]
<p>Both bracketed terms are non-negative. The first term is zero iff
\(\eqref{eq:FRDT_KKT_L}\) is satisfied, and the second term is zero iff
\(\eqref{eq:FRDT_KKT_Psi}\) is satisfied.</p>
<p>The above suggests the following idea:</p>
<ul>
<li>
<p>Choose an update rule for \(w_{t+1}\) such that
\(\eqref{eq:FRDT_KKT_L}\) is always satisfied, so that at each
step, \(F(w_t,a_t) = P(w_t)\);</p>
</li>
<li>
<p>Choose an update rule for \(a_{t+1}\) that takes a step towards
minimizing the second term in the duality gap:
\(\min_{a'} F(w_t,a') - D(a') = \Psi^*(-V^*a') + \Psi(w_t) - \left\langle w_t, -V^* a' \right\rangle\).</p>
</li>
</ul>
<p>Note that, compared to the saddle-point paradigm from the previous
subsection, this idea seems completely backwards:</p>
<ul>
<li>
<p>We saw that fully-optimizing in the primal for the saddle-point
problem \(\min_w \max_a F(w,a)\) leads to choosing \(w_t\) that always
satisfies the KKT condition for \(\Psi\)
\(\eqref{eq:FRDT_KKT_Psi}\); whereas here we enforce the KKT
condition for \(\mathcal{L}\)
\(\eqref{eq:FRDT_KKT_L}\).</p>
</li>
<li>
<p>For the dual update, the saddle-point approach suggests to choose
\(a_{t+1}\) as taking a step towards \(\max_{a'} F(w_t,a')\), i.e to
take a step towards minimizing the first term in the duality gap:
\(\min_{a'} P(w_t) - F(w_t,a')\); whereas here we take a step towards
minimizing the second term.</p>
</li>
</ul>
<p>This is due to the fact that here we choose \(w_t\) to optimize only the
first term of the duality-gap split: \(P(w) - F(w,a)\), and boldly ignored
the second term…</p>
<h3 id="a-trick-to-enforce-the-wrong-kkt-condition">A trick to enforce the “wrong” KKT condition</h3>
<p>To actually implement the idea explained above, we are faced with a
difficulty. The primal update rule is to choose \(w_t\) such that
\(V w_t \in \partial \mathcal{L}^*(a_t)\). A naive approach is to
dumbly compute \(\partial \mathcal{L}^*(a_t)\) and to somehow find a \(w_t\)
in its preimage by \(V\). But \(V\), the evaluation operator, is typically
difficult to invert (think of \(V\) as the data matrix and
\(V \in \mathbb{R}^{n \times p}\) with \(p \gg n\)).</p>
<p>Now comes a trick: suppose we have, at timestep \(t\),
\(V w_t \in \partial \mathcal{L}^*(a_t)\), and we want to construct
\(w_{t+1}\) such that \(V w_{t+1} \in \partial \mathcal{L}^*(a_{t+1})\).
Further suppose that we chose to update \(a_{t+1}\) by mirror descent
using \(\mathcal{L}^*\):</p>
\[\begin{aligned}
\nabla \mathcal{L}^*(a_{t+1})
&= \nabla \mathcal{L}^*(a_t) - \sigma_t
\left.
\partial_a \left(
\Psi^*(-V^*a) + \Psi(w_t) - \left\langle w_t, -V^* a \right\rangle
\right)
\right|_{a_t} \\
&= \nabla \mathcal{L}^*(a_t)
+ \sigma_t V \nabla \Psi^*(-V^* a_t)
- \sigma_t V w_t.\end{aligned}\]
<p>Thus, we wish to construct \(w_{t+1}\) such that</p>
\[V w_{t+1} = V w_t
+ \sigma_t V \nabla \Psi^*(-V^* a)
- \sigma_t V w_t.\]
<p>Clearly a possible choice is to simply set
\(w_{t+1} = (1-\sigma_t) w_t + \sigma_t \nabla \Psi^*(-V^* a_t)\) !</p>
<p>Thus, we obtain the following algorithm for implicit regularization:</p>
\[\begin{aligned}
w_0, a_0 & ~\text{such that}~ a_0 \in \partial \mathcal{L}(Vw_0) \\
\nabla \mathcal{L}^*(a_{t+1}) &= \nabla \mathcal{L}^*(a_t)
+ \sigma_t V \nabla \Psi^*(-V^* a_t)
- \sigma_t V w_t \\
w_{t+1} &= (1-\sigma_t) w_t + \sigma_t \nabla \Psi^*(-V^* a_t)\end{aligned}\]
<p>which can be simplified as</p>
\[\begin{aligned}
w_0, a_0 & ~\text{such that}~ a_0 \in \partial \mathcal{L}(Vw_0) \\
a_t &= \nabla \mathcal{L}(V w_t) \\
w_{t+1} &= (1-\sigma_t) w_t + \sigma_t \nabla \Psi^*(-V^* a_t)\end{aligned}\]
<p>To avoid confusion, note that the first equation (defining \(a_{t+1}\)) is
what we called the primal update rule; and that the second equation
(which looks like a gradient step for \(w_{t+1}\)) is actually the
mirror descent update for the dual.</p>
<p>This trick that allows to choose \(w_{t+1}\) satisfying
\(\eqref{eq:FRDT_KKT_L}\), can also be applied to other choices of the
dual update. A crucial ingredient is that the dual update should involve
mirror descent using \(\mathcal{L}^*\). In particular, the trick can be
applied for accelerated mirror descent in the dual: see the paragraph
just below Lemma 3.2 in <a href="http://arxiv.org/abs/2107.00595">(Ji, Srebro and Telgarsky, 2021)</a>, as well as their Appendix B.</p>
<h3 id="relation-to-frank-wolfe">Relation to Frank-Wolfe</h3>
<p>Note that the above method seems significantly different from anything
we could arrive to by mixing-and-matching primal and dual updates for
the saddle-point formulation. Indeed, the primal update rule</p>
\[\begin{aligned}
u_t &= \partial \Psi^*(-V^*a_t) \\
w_{t+1} &= (1-\sigma_t) w_t + \sigma_t u_t\end{aligned}\]
<p>looks a bit mysterious: it’s neither a straightforward variant of gradient descent, nor of mirror
descent, nor of proximal gradient descent.</p>
<p>It’s strongly reminiscent, though, of the Frank-Wolfe a.k.a conditional
gradient method,<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">1</a></sup> since the primal update consists in setting
\(w_{t+1}\) to a convex combination of \(w_t\) and \(u_t\).
And indeed, it may be seen as a generalization of Frank-Wolfe to regularized instead of
constrained optimization problems (<a href="http://arxiv.org/abs/1211.6302">Bach 2013</a>, equation (17)).
To see this, consider the case where \(\Psi(w) = \iota_\Omega(w)\) for some
convex set \(\Omega\). Then, denoting \(g_t = V^* a_t\), the above method is
equivalent to</p>
\[\begin{aligned}
g_t &= V^* a_t
= V^* \nabla \mathcal{L}(V w_t)
= \left.\nabla_w \mathcal{L}(V w)\right|_{w_t} \\
u_t &= \partial \Psi^*(-g_t)
= \mathop{\mathrm{arg\,max}}_s \left\langle s, -g_t \right\rangle - \Psi(s)
= \mathop{\mathrm{arg\,min}}_{s \in \Omega} \left\langle s, g_t \right\rangle \\
w_{t+1} &= (1-\sigma_t) w_t + \sigma_t u_t\end{aligned}\]
<p>which is exactly the Frank-Wolfe algorithm for the problem
\(\min_w \mathcal{L}(Vw) ~\text{s.t}~ w \in \Omega\).</p>
<h3 id="equivalence-to-a-fully-dual-approach">Equivalence to a fully dual approach</h3>
<p>Denote \(S_P\) (resp. \(S_D\)) the optimal solution set for the primal
problem \(\eqref{eq:FRDT_primal}\) (resp. dual problem
\(\eqref{eq:FRDT_dual}\)). According to the FRDT, \(w \in S_P\) iff there
exists \(a \in S_D\) such that
\(\eqref{eq:FRDT_KKT_L}\). This motivates the following fully dual
approach:</p>
<ul>
<li>
<p>Choose an update rule for \(w_{t+1}\) such that
\(\eqref{eq:FRDT_KKT_L}\) is always satisfied;</p>
</li>
<li>
<p>Choose an update rule for \(a_{t+1}\) that takes a step towards
solving the dual problem: \(\max_{a'} D(a')\) i.e
\(\min_{a'} -D(a') = \Psi^*(-V^* a') + \mathcal{L}^*(a')\).</p>
</li>
</ul>
<p>For the dual update rule we may choose mirror descent using
\(\mathcal{L}^*\).
For the primal update rule, we may apply the same trick as above to enforce
\(\eqref{eq:FRDT_KKT_L}\)
Thus we are led to the exact same algorithm as by the duality-gap approach.</p>
<p>In hindsight this equivalence is not surprising at all. Indeed, since we
enforce \(\eqref{eq:FRDT_KKT_L}\), the dual problem (towards solving which we
let \(a_{t+1}\) take a step) is the same in both approaches:</p>
\[\min_{a'} F(w_t,a')-D(a')
= P(w_t)-D(a')
\equiv \max_{a'} D(a').\]
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:3" role="doc-endnote">
<p><a href="https://en.wikipedia.org/wiki/Frank-Wolfe_algorithm">https://en.wikipedia.org/wiki/Frank-Wolfe_algorithm</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Guillaume WangThis is the second of a series of posts on optimization of regularized linear models through the lens of duality. See the first one here.Regularized linear models and the Fenchel-Rockafellar duality theorem (I): Generalities2021-08-27T00:00:00+00:002021-08-27T00:00:00+00:00https://guillaumew16.github.io/math/2021/08/27/FRDT_generalities<blockquote>
<p>This is the first of a series of posts on optimization of regularized linear models through the lens of duality.
The series will be essentially a summary of what I learned during the first few weeks of my internship, when I looked into topics related to learning in Banach spaces (before I moved on to different, more concrete topics).
The relevant convex analysis background can be found in last time’s <a href="/math/2021/08/08/func_convex_analysis_cheatsheet.html">cheatsheet</a> – which is basically the zero-th post of this series.</p>
<p>With the background definitions under our belt, we are <em>almost</em> ready to talk about concrete consequences of convex duality.
As it turns out, to get a principled understanding of the many places where duality pops up, it is beneficial to first present the Fenchel-Rockafellar duality theorem, a kind of “master theorem” for convex duality.</p>
</blockquote>
<p>Many smart observations on (explicitly or implicitly) regularized linear
models, as well as on convex optimization methods, make use of some sort
of convex duality argument. For example, already in introductory ML
courses, Lagrangian duality is typically used to show that the SVM
solution is equivalent to the max-margin classifier. In this document we
present a unified framework for these duality arguments, that makes it
conceptually easier to draw connections.</p>
<p>On a personal note, I have always found convex and Langrangian duality
to be particularly black magic. The nice thing about the
Fenchel-Rockafellar duality theorem is that it is general enough to
contain all of that black magic, making duality-based derivations easier
to follow.</p>
<ul id="markdown-toc">
<li><a href="#sec:setup_notation" id="markdown-toc-sec:setup_notation">Generic supervised learning setup and notation</a> <ul>
<li><a href="#standard-notations-for-convex-duality" id="markdown-toc-standard-notations-for-convex-duality">Standard notations for convex duality.</a></li>
</ul>
</li>
<li><a href="#sec:FRDT" id="markdown-toc-sec:FRDT">Fenchel-Rockafellar duality theorem (FRDT)</a></li>
<li><a href="#sec:variants_of_regu" id="markdown-toc-sec:variants_of_regu">Variants of regularization: penalized, constrained, min-norm interpolation</a> <ul>
<li><a href="#coming-up-next" id="markdown-toc-coming-up-next">Coming up next…</a></li>
</ul>
</li>
</ul>
<h3 id="sec:setup_notation">Generic supervised learning setup and notation</h3>
<p>Consider supervised learning with a linear model (linear in the
parameters) i.e a hypothesis space of the form</p>
\[\mathcal{F}= \left\lbrace
f_w: x \mapsto \left\langle w, \phi(x) \right\rangle,
w \in \mathcal{W}
\right\rbrace\]
<p>for some feature map
\(\phi: \mathcal{X}\to \mathcal{W}^*\), where \(\mathcal{X}\) is input space
and \(\mathcal{W}\) is a Banach parameter space.</p>
<p>Suppose we are given a dataset \((x_i, y^{\text{tgt}}_i)_{i \leq n}\) and
we want to solve an optimization problem of the form <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>
\[\mathop{\mathrm{arg\,min}}_{w \in \mathcal{W}} \Psi(w) + \frac{1}{n} \sum_{i=1}^n \ell \left( y^{\text{tgt}}_i, \left\langle w, \phi(x_i) \right\rangle \right),\]
<p>where each \(\ell(y^{\text{tgt}}_i, \cdot)\) is a convex function, and the
regularizer term \(\Psi\) is convex. This generic formulation captures
pretty much all standard supervised learning settings, see the last section of this post.</p>
<p>Further pose the shorthands:</p>
<ul>
<li>
<p>\(\mathcal{Y}= (\mathbb{R}^n, \left\lVert \cdot \right\rVert)\) for
some arbitrary norm; \(\mathcal{Y}^*\) will denote \(\mathbb{R}^n\)
equipped with the dual norm \(\left\lVert \cdot \right\rVert_*\);</p>
</li>
<li>
<p>\(\mathcal{L}(y) = \frac{1}{n} \sum_i \ell(y^{\text{tgt}}_i, y_i)\)
the data-fitting term;</p>
</li>
<li>
<p>\(V: \left[ \mathcal{W}\to \mathcal{Y}, w \mapsto (\left\langle w, \phi(x_i) \right\rangle)_{i \leq n} \right]\)
the evaluation operator.</p>
</li>
</ul>
<h4 id="standard-notations-for-convex-duality">Standard notations for convex duality.</h4>
<ul>
<li>
<p>For a Banach space \(E\), the dual space is denoted \(E^*\).</p>
</li>
<li>
<p>The set of proper, lower-semicontinuous (l.s.c), and convex
functions over \(E\) is denoted \(\Gamma(E)\).</p>
</li>
<li>
<p>For a convex function \(\Psi\), \(\Psi^*\) denotes the convex conjugate.</p>
</li>
<li>
<p>For a linear operator \(V: \mathcal{W}\to \mathcal{Y}\) between Banach
spaces, \(V^*: \mathcal{Y}^* \to \mathcal{W}^*\) denotes the adjoint
operator between the dual spaces.</p>
</li>
<li>
<p>The indicator function of a set \(A\) is defined as
\(\iota_A(x) = \begin{cases}
\infty \text{ if } x \not\in A \\
0 \text{ if } x \in A
\end{cases}\).</p>
</li>
<li>
<p>\(\left\lVert \cdot \right\rVert\) denotes an arbitrary norm, and
\(\left\lVert \cdot \right\rVert_*\) denotes its dual norm.</p>
</li>
</ul>
<h3 id="sec:FRDT">Fenchel-Rockafellar duality theorem (FRDT)</h3>
<p>Let us state the FRDT with machine-learning-friendly notation, as
motivated above.</p>
<div class="theorem" text="FRDT">
<p>Let \(\mathcal{W}\) and \(\mathcal{Y}\) be two real Banach spaces. Let
\(\Psi \in \Gamma(\mathcal{W})\), let
\(\mathcal{L}\in \Gamma(\mathcal{Y})\), and let
\(V: \mathcal{W}\to \mathcal{Y}\) be a bounded linear operator. Consider
the primal problem</p>
\[\label{eq:FRDT_primal} \tag{P}
\min_{w \in \mathcal{W}} \Psi(w) + \mathcal{L}(V w) =: P(w)\]
<p>and define the dual problem as</p>
\[\label{eq:FRDT_dual} \tag{D}
\max_{a \in \mathcal{Y}^*} - \Psi^*(-V^* a) - \mathcal{L}^*(a) =: D(a).\]
<p>Denote \(S_P\) and \(S_D\) their respective optimal solution sets. Denote
the KKT conditions</p>
\[\label{eq:FRDT_KKT_Psi} \tag{KKT_Psi}
w \in \partial \Psi^*(-V^* a)
~~\text{i.e}~~
-V^* a \in \partial \Psi(w)\]
\[\label{eq:FRDT_KKT_L} \tag{KKT_L}
a \in \partial \mathcal{L}(V w)
~~\text{i.e}~~
V w \in \partial \mathcal{L}^*(a)\]
<p>Weak duality holds: for all \(w\) and \(a\),
\(P(w) \geq P_{\text{opt}} \geq D_{\text{opt}} \geq D(a)\).</p>
<p>Suppose that
\(0 \in \text{interior}( V(\mathop{\mathrm{dom}}\Psi)- \mathop{\mathrm{dom}}\mathcal{L})\).
Then strong duality holds:</p>
<ul>
<li>
<p>(P) and (D) have the same optimal value
\(P_{\text{opt}} = D_{\text{opt}}\);</p>
</li>
<li>
<p>\(w \in S_P\) and \(a \in S_D\) iff
\(\eqref{eq:FRDT_KKT_L}\) and \(\eqref{eq:FRDT_KKT_Psi}\);</p>
</li>
<li>
<p>\(w \in S_P\) iff there exists \(a \in \mathcal{Y}^*\) such that
\(\eqref{eq:FRDT_KKT_Psi}\) and
\(\eqref{eq:FRDT_KKT_L}\), iff there exists \(a \in S_D\) such that
\(\eqref{eq:FRDT_KKT_L}\).</p>
</li>
</ul>
</div>
<div class="proof" text="sketched">
<p>The actual proof is much more complicated, <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> but hinges on the
following calculation which is enough to gain intuition:</p>
\[\begin{aligned}
\eqref{eq:FRDT_primal}
&\equiv \min_w \Psi(w) + \mathcal{L}(Vw) \\
&\equiv \min_w \max_a \Psi(w) + \left\langle Vw, a \right\rangle_\mathcal{Y}- \mathcal{L}^*(a) \\
&\equiv \min_w \max_a \Psi(w) + \left\langle w, V^* a \right\rangle_\mathcal{W}- \mathcal{L}^*(a) \\
&\geq \max_a \min_w - \left[ \left\langle w, -V^* a \right\rangle_\mathcal{Y}- \Psi(w) \right] - \mathcal{L}^*(a) \\
&\equiv -\Psi^*(-V^*a) - \mathcal{L}^*(a)
\equiv \eqref{eq:FRDT_dual}
\end{aligned}\]
<p>Denote
\(F(w,a) = \Psi(w) + \left\langle Vw, a \right\rangle - \mathcal{L}^*(a)\).
The above calculation shows that \(P(w) \geq F(w,a) \geq D(a)\) for all
\(w,a\). To see where the KKT conditions come from, note that
\(P(w) = D(a)\) implies</p>
<ul>
<li>
<p>\(F(w,a) = P(w)\) i.e
\(\left\langle Vw, a \right\rangle_\mathcal{Y}- \mathcal{L}^*(a)= \mathcal{L}(Vw)\),
i.e \(a\) saturates the Fenchel-Young inequality, i.e
\(a \in \partial \mathcal{L}(Vw)\);</p>
</li>
<li>
<p>\(F(w,a) = D(a)\) i.e
\(\left\langle w, -V^* a \right\rangle_\mathcal{Y}- \Psi(w) = \Psi^*(-V^*a)\),
i.e \(w\) saturates the Fenchel-Young inequality, i.e
\(w \in \partial \Psi^*(-V^* a)\).</p>
</li>
</ul>
</div>
<div class="remark">
<p>The condition that
\(0 \in \text{interior}( V(\mathop{\mathrm{dom}}\Psi)- \mathop{\mathrm{dom}}\mathcal{L})\)
is morally just a variant of Slater’s condition: “there exists a
strictly feasible point”. There are other constraint qualification
conditions that imply strong duality.</p>
</div>
<div class="remark">
<p>The adjoint of the evaluation operator, \(V^*\), is simply given by</p>
\[\forall a \in \mathcal{Y}^* = \mathbb{R}^n,~
V^* a = \sum_{i=1}^n a_i \left\langle \cdot, \phi(x_i) \right\rangle
= \left\langle \cdot, \sum_{i=1}^n a_i \phi(x_i) \right\rangle.\]
<p>For finite-dimensional features, say \(\dim(\mathcal{W}) = p\), \(V\) can be
seen as the transformed data matrix \(\begin{bmatrix}
\phi(x_1) & ... & \phi(x_n)
\end{bmatrix}^\top \in \mathbb{R}^{n \times p}\), and \(V^*\) is simply
its transpose.</p>
</div>
<h3 id="sec:variants_of_regu">Variants of regularization: penalized, constrained, min-norm interpolation</h3>
<p>It is common wisdom that <em>models with lower “complexity” have better
generalization properties</em>.</p>
<p>Traditionally, low “complexity” of the learned model is ensured in
practice by adding a penalty term to the loss function. In our notation:
\(f_w\) is chosen as minimizing \(\lambda \psi(w) + \mathcal{L}(Vw)\), for
some regularizer or “complexity measure” \(\psi\). Still traditionally, a
theory-friendlier alternative is to constrain the parameters to a
low-complexity set \(\Omega\), e.g
\(\Omega = \left\lbrace w; \psi(w) \leq B \right\rbrace\).</p>
<p>Parallel to these ideas, also of interest for theory and practice are
so-called <em>overparametrized</em> settings, whereby
\(\dim(\mathcal{W}) \gg \dim(\mathcal{X})\), so that there exist many
values of \(w\) that interpolate the data i.e such that
\(\mathcal{L}(V w) = 0\). In such settings, the aforementioned common
wisdom can be interpreted in a third way: select, among all
interpolating values of \(w\), the one with the lowest “complexity”:
\(\mathop{\mathrm{arg\,min}}_w \psi(w) ~\text{s.t}~ \mathcal{L}(V w) = 0\).
<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
<p>Note that, since FRDT strictly generalizes Lagrangian duality, it offers
a unified framework for:</p>
<ul>
<li>
<p>regularization via penalization, by letting \(\Psi(w)\) be the penalty
term \(\lambda \psi(w)\);</p>
</li>
<li>
<p>regularization via constraining the parameters to a convex set
\(\Omega\), by letting \(\Psi(w) = \iota_\Omega(w)\);</p>
</li>
<li>
<p>min-regularizer e.g min-norm interpolation, by letting
\(\mathcal{L}(Vw) = \iota_{ \{y^{\text{tgt}}\} }(Vw)\).</p>
</li>
</ul>
<p>Regardless of the setting (penalized, constrained, or interpolation),
many smart observations on linear models can be made by using some sort
of duality argument. The nice thing about the FRDT is that it provides a
unified way to formulate all those duality-based arguments.</p>
<h4 id="coming-up-next">Coming up next…</h4>
<p>In the next few posts, I will discuss a few example topics where
convex duality arguments are invoked, through the lens of the FRDT.
I hope to convince you that adopting that viewpoint is indeed very helpful
for getting a principled understanding of those topics.
Up next: a zoo of primal-dual methods for optimization of regularized linear models.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Usually the dataset is denoted \((x_i, y_i)_{i \leq n}\) and the
predictions are denoted \(\hat{y}_i\). Here we chose to denote
\(y^{\text{tgt}}_i\) the target labels because it will be more
convenient to denote simply \(y_i\) the predictions. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>The proof can be found in Section 31 of Rockafellar’s 1970 book
“Convex analysis”, and on this webpage:
<a href="https://pwacker.com/fenchelrockafellar.html">https://pwacker.com/fenchelrockafellar.html</a>, which also contains
nice illustrative drawings. Apparently this blog post series was
also planning to nicely present the proof:
<a href="https://dohmatob.github.io/research/2019/10/31/duality.html">https://dohmatob.github.io/research/2019/10/31/duality.html</a>, but
it hasn’t been updated in a while. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Note that this formalization of "data-interpolation" is only for
regression, since for classification with the logistic loss for
example, \(\mathcal{L}(V w) = 0\) is impossible. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Guillaume WangThis is the first of a series of posts on optimization of regularized linear models through the lens of duality. The series will be essentially a summary of what I learned during the first few weeks of my internship, when I looked into topics related to learning in Banach spaces (before I moved on to different, more concrete topics). The relevant convex analysis background can be found in last time’s cheatsheet – which is basically the zero-th post of this series. With the background definitions under our belt, we are almost ready to talk about concrete consequences of convex duality. As it turns out, to get a principled understanding of the many places where duality pops up, it is beneficial to first present the Fenchel-Rockafellar duality theorem, a kind of “master theorem” for convex duality.A functional and convex analysis cheat sheet2021-08-08T00:00:00+00:002021-08-08T00:00:00+00:00https://guillaumew16.github.io/math/2021/08/08/func_convex_analysis_cheatsheet<blockquote>
<p>Another post that was prepared a while ago and “snoozed” until now…
It is part of a planned series of posts on linear models and regularization, and a tinge of optimization.</p>
<p>As last time, I’m not satisfied with the math rendering, so here is the <a href="/contents/func_convex_analysis_cheatsheet.pdf">LaTeX version</a>.</p>
</blockquote>
<p>Among the topics that I’ve been interested in during the last few
months, many required some knowledge of convex analysis: regularization
in linear models, primal-dual views on optimization, representer
theorems in reproducing kernel Banach spaces... Moreover the
finite-dimensional setting doesn’t suffice, a more abstract point of
view is necessary or at least useful; namely Banach spaces seem to be
the appropriate level of abstraction for those topics.</p>
<p>In this document I compile some relevant functional and convex analysis
background, in the form of a cheat sheet. It is not at all meant to be
exhaustive, I only included basic facts and tricks that I found
interesting. I may add to it in the future.</p>
<p>Proofs and appendices can be found in the <a href="/contents/func_convex_analysis_cheatsheet.pdf">LaTeX version</a> of this document.</p>
<ul id="markdown-toc">
<li><a href="#functional-analysis-banach-duality" id="markdown-toc-functional-analysis-banach-duality">Functional analysis (Banach duality)</a> <ul>
<li><a href="#duality-in-banach-spaces" id="markdown-toc-duality-in-banach-spaces">Duality in Banach spaces</a></li>
<li><a href="#hahn-banach-theorem-and-useful-consequences" id="markdown-toc-hahn-banach-theorem-and-useful-consequences">Hahn-Banach theorem and useful consequences</a></li>
</ul>
</li>
<li><a href="#convex-analysis-convex-duality" id="markdown-toc-convex-analysis-convex-duality">Convex analysis (convex duality)</a> <ul>
<li><a href="#convex-conjugate" id="markdown-toc-convex-conjugate">Convex conjugate</a></li>
<li><a href="#convex-conjugates-vssubdifferentials" id="markdown-toc-convex-conjugates-vssubdifferentials">Convex conjugates vs. subdifferentials</a></li>
<li><a href="#convex-conjugacy-swaps-strict-convexity-for-differentiability-and-strong-convexity-for-smoothness" id="markdown-toc-convex-conjugacy-swaps-strict-convexity-for-differentiability-and-strong-convexity-for-smoothness">Convex conjugacy swaps strict convexity for differentiability, and strong convexity for smoothness</a></li>
</ul>
</li>
</ul>
<h2 id="functional-analysis-banach-duality">Functional analysis (Banach duality)</h2>
<p>Beyond finite dimension, Banach spaces are a simple and natural level of
abstraction for discussing convex analysis. This section is mostly
extracted from the appendix of my <a href="/contents/Master_s_thesis_report-final.pdf">Master’s thesis</a>.</p>
<div class="definition" text="Banach space">
<p>A metric space \((E,d)\) is called <em>complete</em> if all Cauchy sequences
\((u_n)_n \in E^\mathbb{N}\) converge in \(E\).</p>
<p>A <em>Banach space</em> \((E, \left\lVert \cdot \right\rVert_E)\) is a vector
space equipped with a norm for which it is a complete space.</p>
<p>A <em>Hilbert space</em> \((H, \left\langle \cdot, \cdot \right\rangle_H)\) is a
vector space equipped with an inner product that is complete for the
induced norm
\(\left\lVert x \right\rVert_H^2 = \left\langle x, x \right\rangle_H\).</p>
<p>The unit ball of a normed space \((E, \left\lVert \cdot \right\rVert_E)\)
is denoted \(B^{(E)} = B^{(E)}_{0,1}
:= \left\lbrace x \in E;~ \left\lVert x \right\rVert_E \leq 1 \right\rbrace\).</p>
<p>A continuous linear mapping \(T: E \to F\) between Banach spaces is called
a <em>bounded operator</em>, and its <em>operator norm</em> is the finite quantity
\({\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert T \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert} = {\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert T \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}_{E \to F}
:= \sup_{\left\lVert x \right\rVert_E \leq 1} \left\lVert T x \right\rVert_F\).
The set of bounded operators from \(E\) to \(F\) equipped with the operator
norm
\((\mathcal{L}_b(E,F), {\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert \cdot \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert})\)
is itself a Banach space.</p>
<p>A bounded operator \(T: E \to F\) is called <em>compact</em> if it sends the unit
ball into a relatively compact set, i.e \(T(B^{(E)})\) is a relatively
compact set of \(F\), i.e \(\overline{T(B^{(E)}})\) is compact where
\(\overline{~\cdot~}\) denotes closure w.r.t the norm of \(F\).</p>
</div>
<h3 id="duality-in-banach-spaces">Duality in Banach spaces</h3>
<div class="definition" text="dual space">
<p>The <em>(topological) dual</em> of a Banach space \(E\) is the space of bounded
linear forms \(E' = \mathcal{L}_b(E, \mathbb{R})\). It is equipped with
the norm
\(\left\lVert X \right\rVert_{E'} := \sup_{\left\lVert x \right\rVert_E \leq 1} \left\lvert X(x) \right\rvert\).
\(E'\) is itself a Banach space.</p>
<p>The <em>duality bracket</em> of \(E\) is the bilinear operator
\(\left\langle \cdot, \cdot \right\rangle_E: E \times E' \to \mathbb{R}\)
defined by \(\left\langle x, X \right\rangle_E = X(x)\).</p>
<p>The <em>bidual</em> of \(E\) is the space \(E'' = (E')'\). \(E\) can be embedded into
\(E''\) by \(x \mapsto x''\), where \(x''\) is defined by:
\(\forall X \in E',~ \left\langle X, x'' \right\rangle_{E'} = \left\langle x, X \right\rangle_E = X(x)\).
It is not hard to show (using existence of norming functionals, see
below) that this embedding is isometric i.e
\(\left\lVert x \right\rVert_{(E')'} = \left\lVert x \right\rVert_E\).</p>
<p>\(E\) is called a <em>reflexive</em> Banach space if the converse holds, i.e if
any element of the bidual can also be seen as an element of the primal,
i.e if \(E'' \simeq E\).</p>
</div>
<p>In this section, elements of the primal space will typically be denoted
by lowercase letters e.g \(x \in E, y \in F\), and elements of the dual by
uppercase letters e.g \(X \in E', Y \in F'\).</p>
<div class="remark" text="bra-ket">
<p>The duality bracket is very similar to the physicists’ bra-ket notation;
except that here the primal is on the left and the dual is on the right,
instead of the opposite.</p>
<p>When \(E\) is reflexive, then all the shorthands from the bra-ket notation
can be used. That is, a dual element \(X\) can be denoted without
ambiguity as \(\left\langle \cdot, X \right\rangle_E\) , and a primal
element \(x = x''\) as \(\left\langle x, \cdot \right\rangle_E\) . Moreover,
for a bounded operator \(T: E \to F\), we can write without ambiguity
\(\left\langle Tx, Y \right\rangle_F = \left\langle x| T |Y \right\rangle\) .
However since there are many interesting Banach spaces that are not
reflexive, we will not use such shorthands.</p>
</div>
<h3 id="hahn-banach-theorem-and-useful-consequences">Hahn-Banach theorem and useful consequences</h3>
<div class="theorem" text="Hahn-Banach">
<p>Let \(\underline{E}\) be a linear subspace of a normed vector space
\((E, \left\lVert \cdot \right\rVert)\) and let
\(f: \underline{E}\to \mathbb{R}\) be a bounded linear form on
\((\underline{E}, \left\lVert \cdot \right\rVert)\).</p>
<p>Then there exists \(g: E \to \mathbb{R}\) a bounded linear form on all of
\(E\), such that</p>
<ul>
<li>
<p>\(g\) is an extension of \(f\):
\(\left.g\right|_{\underline{E}} = f\);</p>
</li>
<li>
<p>The extension “comes to no cost” in operator norm:
\({\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert g \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert} = {\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert f \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}\).</p>
</li>
</ul>
<p>(Here the operator norms are with respect to their respective domains:
\({\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert f \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert} = \sup_{x \in \underline{E}; \left\lVert x \right\rVert \leq 1} \left\lvert f(x) \right\rvert\),
\({\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert g \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert} = \sup_{x \in E; \left\lVert x \right\rVert \leq 1} \left\lvert g(x) \right\rvert\).)</p>
</div>
<p>As one of the many important consequences of that theorem, we have the
existence of norming functionals.</p>
<div class="definition" text="norming functional">
<p>Let \(E\) be a Banach space.</p>
<p>For all \(x \in E \setminus \{0_E\}\), there exists \(X \in E'\) such that
\(X(x) = \left\langle x, X \right\rangle_E = \left\lVert x \right\rVert_E\)
and \(\left\lVert X \right\rVert_{E'} = 1\). \(X\) is then called a <em>norming
functional</em> of \(x\).</p>
<p>By convention, any \(X \in B^{(E')}\) will be called a norming functional
of \(0_E\).</p>
</div>
<p>Importantly, the norming functional \(X\) is not unique in general, and
there is no generic way to construct it – the proof is not
constructive. <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> (This stands in contrast with the case of Hilbert
spaces, where the norming functional is unique and given by the Riesz
representation theorem.)</p>
<div class="proposition">
<p>The primal \(E\) injects isometrically into the bidual \(E''\).
Namely, \(x \in E\) is canonically associated to \(x'' \in E''\) defined by
\(\forall X \in E', \left\langle X, x'' \right\rangle_{E'} = \left\langle x, X \right\rangle_E\).</p>
</div>
<p>Another important consequence of the Hahn-Banach theorem is the
following density criterion.</p>
<div class="lemma" text="Hahn-Banach density criterion">
<p>Let \(E\) be a Banach space. Let \(\underline{E}\) be any subspace and \(A\)
any subset.</p>
<p>\(\begin{aligned}
\underline{E}\text{ is dense in } E
&& \iff &&
\underline{E}^\perp :=
\left\lbrace
X \in E';~
\forall x \in \underline{E}, \left\langle x, X \right\rangle_E = 0
\right\rbrace
= \{ 0_{E'} \} \\
\mathop{\mathrm{span}}(A) \text{ is dense in } E
&& \iff &&
A^\perp :=
\left\lbrace
X \in E';~
\forall a \in A, \left\langle a, X \right\rangle_E = 0
\right\rbrace
= \{ 0_{E'} \}
\end{aligned}\)</p>
</div>
<p>The set \(A^\perp\) is called the <em>annihilator</em> of \(A\). Note that
\(A^\perp = (\mathop{\mathrm{span}}(A))^\perp\). In the case where \(E\) is
a Euclidean space, \(A^\perp\) is just (up to isometry) the orthogonal
complement of \(\mathop{\mathrm{span}}(A)\).</p>
<h2 id="convex-analysis-convex-duality">Convex analysis (convex duality)</h2>
<p>For the rest of this subsection, fix a Banach space \(E\).</p>
<div class="definition">
<p>A function \(f: E \to \mathbb{R}\cup \{+\infty\}\) is called <em>convex</em> if</p>
\[\forall x, y \in E, \forall t \in [0,1],~ f \left( tx + (1-t) y \right) \leq t f(x) + (1-t) f(y).\]
<p>For any convex \(f: E \to \mathbb{R}\cup \{+\infty\}\),</p>
<ul>
<li>
<p>The <em>domain</em> of \(f\) is the convex set
\(\mathop{\mathrm{dom}}(f) = \left\lbrace
x \in E; f(x) < \infty
\right\rbrace\). \(f\) is called <em>proper</em> if
\(\mathop{\mathrm{dom}}(f) \neq \varnothing\).</p>
</li>
<li>
<p>\(f\) is called <em>lower-semicontinuous</em> (l.s.c) if its sub-level sets
are closed, i.e for each \(c \in \mathbb{R}\),
\(\left\lbrace x \in E; f(x) > c \right\rbrace\) is an open set.</p>
</li>
</ul>
<p>Denote \(\Gamma(E)\) the set of proper l.s.c convex functions over \(E\).</p>
</div>
<div class="definition">
<p>For any proper convex function \(f: E \to \mathbb{R}\cup \{+\infty\}\),</p>
<ul>
<li>
<p>The <em>subdifferential</em> of \(f\) at a point \(x_0 \in E\) is the set</p>
\[\partial f(x_0) = \left\lbrace
X \in E'; \forall x \in E, f(x) \geq f(x_0) + \left\langle x-x_0, X \right\rangle_E
\right\rbrace.\]
<p>\(f\) is called <em>subdifferentiable</em> at
\(x_0\) if \(\partial f(x_0) \neq \varnothing\). Note that \(f\) can be
subdifferentiable at \(x_0\) only if
\(x_0 \in \mathop{\mathrm{dom}}(f)\).</p>
</li>
<li>
<p>\(f\) is called <em>differentiable</em> at \(x_0\) if \(\partial f(x_0)\) is a
singleton. Its unique element is then called the <em>differential</em> of
\(f\) at \(x_0\) and denoted \(D f(x_0)\) or \(\nabla f(x_0)\).</p>
</li>
</ul>
</div>
<p>Note that the definitions of “subdifferential” and “differential” above
look different from the usual ones, since they only apply to convex
functions. It can be shown that our definitions are compatible with the
usual ones from real analysis. <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>
<div class="proposition">
<p>For any proper l.s.c function \(f: E \to \mathbb{R}\cup \{+\infty\}\) such
that \(\mathop{\mathrm{dom}}(f)\) is open,</p>
<ul>
<li>
<p>\(f\) is convex iff \(\mathop{\mathrm{dom}}(f)\) is convex and
\(\partial f(x_0) \neq \varnothing\) for all
\(x_0 \in \mathop{\mathrm{dom}}(f)\).</p>
</li>
<li>
<p>If \(f\) is convex, then it is differentiable at \(x_0\) (in the usual
real-analytic sense) iff \(\partial f(x_0)\) is a singleton, and the
differential of \(f\) at \(x_0\) (in the usual real-analytic sense) is
then \(D f(x_0)\).</p>
</li>
</ul>
</div>
<h3 id="convex-conjugate">Convex conjugate</h3>
<div class="definition">
<p>For any proper function \(f: E \to \mathbb{R}\cup \{+\infty\}\), the
<em>convex conjugate</em> of \(f\) (a.k.a Fenchel-Legendre a.k.a Legendre-Fenchel
a.k.a Fenchel a.k.a Legendre transform) is the function</p>
<p>\(f^*: \left[ E' \to \mathbb{R}\cup \{+\infty\},
X \mapsto \sup_{x \in E} \left\langle x, X \right\rangle_E - f(x) \right].\)</p>
</div>
<div class="proposition" text="Fenchel-Moreau theorem">
<p>For any proper function \(f\), \(f^*\) is a proper l.s.c convex function.</p>
<p>A function \(f\) is a proper l.s.c convex function iff \(f^{**} = f\).</p>
</div>
<p>For any proper function \(f\), \(f^{**}\) is the tightest convex relaxation
of \(f\), in the sense that the epigraph of \(f^{**}\) is the convex hull of
the epigraph of \(f\). This is easy to visualize for functions over the
real line.</p>
<div class="remark">
<p>In the proposition above, \(f^{**}\) is understood as a mapping from \(E\)
to \(\mathbb{R}\). Looking at the definitions, it would be more natural to
view \(f^{**}\) as a mapping from \(E''\) to \(\mathbb{R}\) instead, which
would be more general since \(E\) injects isometrically into \(E''\).
However in convex analysis we typically don’t care about what happens
outside of \(E\).</p>
<p>More precisely: to be completely general and consistent with notation,
we could define \(f^{**} = (f^*)^*\) over \(E''\) by
\(\forall z \in E'',~ f^{**}(z) = \sup_{X \in E'} \left\langle X, z \right\rangle_{E'} - f^*(X).\)</p>
<p>Since
\(\left\langle X, x'' \right\rangle_{E'} = \left\langle x, X \right\rangle_E\)
(where \(\left[ E \to E'', x \mapsto x'' \right]\) denotes the canonical
injection), the restriction of \(f^{**}\) to \(E\) is then – and this
equation is typically taken as the definition of \(f^{**}\):</p>
<p>\(\forall x \in E,~ f^{**}(x) =
\sup_{X \in E'} \left\langle x, X \right\rangle_E - f^*(X).\)</p>
</div>
<p>In the context of convex analysis it is common to denote
adjoint/conjugate/dual objects with a superscript "\(*\)". In other
contexts that symbol connotes involution, which may be misleading.
However for convex analysis there is not much risk of mistake, precisely
because of the previous remark: we only ever care about what happens in
\(E\) and \(E'\), never about the bidual space \(E''\). In particular, even if
\(f\) is not a proper l.s.c function, we may always write
\((f^{**})^* = f^*\).</p>
<p>Accordingly, <strong>from here on we will follow the common practice and use
\(x^*\) (instead of \(X\)) to denote a generic element of \(E'\).</strong></p>
<p>Many of the useful properties of convex conjugates can be found on the relevant <a href="https://en.wikipedia.org/w/index.php?title=Convex_conjugate&oldid=1007941296">wikipedia page</a>, so I won’t list those here again.</p>
<h3 id="convex-conjugates-vssubdifferentials">Convex conjugates vs. subdifferentials</h3>
<div class="proposition" text="Fenchel-Young inequality">
<p>Let \(f \in \Gamma(E)\). By definition of the convex conjugate, we have
Fenchel-Young’s inequality:</p>
\[\forall x \in E, \forall x^* \in E',~
f(x) + f^*(x^*) \geq \left\langle x, x^* \right\rangle_E.\]
<p>For
\(x \in E\) and \(x^* \in E'\),</p>
<ul>
<li>
<p>\(x^* \in \partial f(x)\) iff \(x^*\) saturates Fenchel-Young’s
inequality, iff \(x^*\) achieves the sup in the definition of
\(f(x) = f^{**}(x) = \sup_{x^* \in E'} \left\langle x, x^* \right\rangle_E - f^*(x^*)\).</p>
</li>
<li>
<p>\(x \in \partial f^*(x^*)\) iff \(x\) saturates Fenchel-Young’s
inequality, iff \(x\) achieves the sup in the definition of
\(f^*(x^*) = \sup_{x \in E} \left\langle x, x^* \right\rangle_E - f(x)\).</p>
</li>
<li>
<p>\(x^* \in \partial f(x)\) iff \(x \in \partial f^*(x^*)\).</p>
</li>
</ul>
</div>
<div class="remark" text="subdifferentials as correspondence, Rockafellar 1970, Theorem 24.9">
<p>Up to an additive constant,
\(f \in \Gamma(E)\) is characterized by the binary relation \(\mathcal{R}\)
given by</p>
\[x \mathcal{R}x^* \iff f(x) + f^*(x^*) = \left\langle x, x^* \right\rangle_E.\]
</div>
<div class="proposition" text="norming functionals as subdifferentials">
<p>The norm \(\left\lVert \cdot \right\rVert_E\) is a proper continuous
convex function by definition.</p>
<p>For any \(x \in E\), \(x^* \in E'\) is a norming functional for \(x\) iff
\(x^* \in \partial \left\lVert \cdot \right\rVert_E(x)\). In symbols,</p>
\[x^* \in \partial \left\lVert \cdot \right\rVert_E(x)
\iff
\begin{cases}
\left\lVert x^* \right\rVert_{E'} = 1 \\
\left\langle x, x^* \right\rangle = \left\lVert x \right\rVert_E
\end{cases}\]
<p>In particular, \(\left\lVert \cdot \right\rVert_E\) is differentiable at
\(x\) iff \(\partial \left\lVert \cdot \right\rVert_E(x)\) is a singleton,
iff \(x\) has a unique norming functional. If
\(\left\lVert \cdot \right\rVert_E\) is differentiable everywhere, then
the mapping
\(\left[ x \mapsto \nabla \left\lVert \cdot \right\rVert_E(x) \right]\) is
well-defined and is called the <em>duality mapping</em>.</p>
</div>
<p>Thus, to the convex analyst, norming functionals are not a magical
byproduct of the Hahn-Banach theorem, but simply a subgradient of the
norm.</p>
<h3 id="convex-conjugacy-swaps-strict-convexity-for-differentiability-and-strong-convexity-for-smoothness">Convex conjugacy swaps strict convexity for differentiability, and strong convexity for smoothness</h3>
<div class="definition">
<p>Let \(f \in \Gamma(E)\) and let \(\mu>0\), \(L>0\).</p>
<p>\(f\) is <em>strictly convex</em> if for all \(x_0 \in \mathop{\mathrm{dom}}(f)\),
there exists \(g \in \partial f(x_0)\) such that the strict inequalities
hold:</p>
\[\forall x \in E \setminus \{x_0\},~ f(x) > f(x_0) + \left\langle x-x_0, g \right\rangle_E.\]
<p>\(f\) is <a href="https://xingyuzhou.org/blog/notes/strong-convexity"><em>\(\mu\)-strongly convex</em></a> if for all
\(x_0 \in \mathop{\mathrm{dom}}(f)\), there exists \(g \in \partial f(x_0)\)
such that</p>
\[\forall x \in E,~ f(x) \geq f(x_0) + \left\langle x-x_0, g \right\rangle_E + \frac{\mu}{2} \left\lVert x-x_0 \right\rVert_E^2.\]
<p>\(f\) is differentiable everywhere if it is differentiable at each point
of its domain, that is, if for all \(x_0 \in \mathop{\mathrm{dom}}(f)\),
there exists <em>a unique</em> \(g \in E'\) such that</p>
\[\forall x \in E,~ f(x) \leq f(x_0) + \left\langle x-x_0, g \right\rangle_E.\]
<p>\(f\) is <em>\(L\)-smooth</em> if for all \(x_0 \in \mathop{\mathrm{dom}}(f)\), there
exists \(g \in \partial f(x_0)\) such that</p>
\[\forall x \in E,~ f(x) \leq f(x_0) + \left\langle x-x_0, g \right\rangle_E + \frac{L}{2} \left\lVert x-x_0 \right\rVert_E^2.\]
<p>Note that \(\mu\)-strong convexity implies strict convexity, and that
\(L\)-smoothness implies differentiability everywhere.</p>
</div>
<div class="proposition" text="Kakade, Shalev-Shwartz, and Tewari, 2009">
<p>Let \(f \in \Gamma(E)\).</p>
<p>\(f\) is strictly convex iff \(f^*\) is differentiable.</p>
<p>\(f\) is \(\mu\)-strongly convex iff \(f^*\) is \(1/\mu\)-smooth.</p>
</div>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Or rather, to the pure functional analyst there is no satisfying
way to construct a norming functional, but to the convex analyst
there is... See below. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>In other words, in this document we are only concerned with
content covered in Rockafellar’s 1970 “Convex Analysis” book,
whereas in other contexts Rockafellar’s 1998 “Variational Analysis”
book may be a better reference. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Guillaume WangAnother post that was prepared a while ago and “snoozed” until now… It is part of a planned series of posts on linear models and regularization, and a tinge of optimization. As last time, I’m not satisfied with the math rendering, so here is the LaTeX version.A fun byproduct of my Master’s thesis: symmetric tensor functions are dense in the space of permutation-invariant multivariate functions2021-07-25T00:00:00+00:002021-07-25T00:00:00+00:00https://guillaumew16.github.io/math/2021/07/25/symmetric_funs_density<blockquote>
<p>I prepared this post a long while ago but only posted it here in July. This is because I had written it in LaTeX and converting it to MD was not completely trivial. Since I had no incentive to post it, this small barrier was enough for me to procrastinate several months…</p>
<p>I’m not completely satisfied with the math rendering, so here is the <a href="/contents/symmetric_funs_density.pdf">LaTeX version</a>.</p>
</blockquote>
<p>During my Master’s thesis, I encountered several interesting technical
points that were not directly related to the thesis topic, so that I
left them hanging. Here I will talk about one of them, encountered while
working on Volterra series. (I chose to jump directly into my main point
without giving any context on Volterra series, as it is not necessary;
for a clean introduction to these objects, see chapter 4 of my Master’s
thesis <a href="/contents/Master_s_thesis_report-final.pdf">report</a>.)</p>
<ul id="markdown-toc">
<li><a href="#preliminaries" id="markdown-toc-preliminaries">Preliminaries</a></li>
<li><a href="#the-result-and-why-it-looks-surprising-to-me" id="markdown-toc-the-result-and-why-it-looks-surprising-to-me">The result and why it looks surprising to me</a></li>
<li><a href="#brief-proof-of-the-result" id="markdown-toc-brief-proof-of-the-result">Brief proof of the result</a></li>
<li><a href="#is-the-result-interestinguseful" id="markdown-toc-is-the-result-interestinguseful">Is the result interesting/useful?</a></li>
</ul>
<h3 id="preliminaries">Preliminaries</h3>
<p><strong>Notations and shorthands</strong>
Fix some integer \(n > 0\).</p>
<ul>
<li>
<p>For a point \(\boldsymbol{t}= (t_1,...,t_n) \in \mathbb{R}^n\) and a
permutation \(\sigma \in \mathfrak{S}_n\), \(\boldsymbol{t}_\sigma\) denotes
\((t_{\sigma(1)},...,t_{\sigma(n)})\).</p>
</li>
<li>
<p>Call a multivariate function \(g: \mathbb{R}^n \to \mathbb{R}\)
<em>permutation-invariant</em> if for any permutation \(\sigma\), it holds
\(g(\boldsymbol{t}) = g(\boldsymbol{t}_\sigma)\) for all \(\boldsymbol{t}\in \mathbb{R}^n\).</p>
</li>
<li>
<p>For any function \(f: \mathbb{R}\to \mathbb{R}\), denote
\(f^{\otimes n}: \left[
\mathbb{R}^n \to \mathbb{R},
\boldsymbol{t}\mapsto f(t_1)...f(t_n)
\right]\). Call
\(f^{\otimes n}\) the associated <em>symmetric tensor function</em><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> –
tensor because it is a product of single-variable functions, and
symmetric because all of those single-variable functions are the
same.</p>
</li>
<li>
<p>For any multivariate function \(g: \mathbb{R}^n \to \mathbb{R}\),
denote \(\mathop{\mathrm{Sym}}g: \left[
\mathbb{R}^n \to \mathbb{R},
\boldsymbol{t}\mapsto \frac{1}{n!} \sum_{\sigma \in \mathfrak{S}_n} g(\boldsymbol{t}_\sigma)
\right]\).</p>
</li>
</ul>
<p>We will sometimes write physicist-style \(f(t)\) to mean a function
\(f: \mathbb{R}\to \mathbb{R}\), and similarly \(g(\boldsymbol{t})\) instead of
\(g: \mathbb{R}^n \to \mathbb{R}\).</p>
<p><strong>Some function spaces</strong>
Fix \(1 \leq p < \infty\) and \(q\) its conjugate exponent, i.e \(1/p+1/q=1\).</p>
<ul>
<li>
<p>Let \(L^p(\mathbb{R})\) be the Banach space of \(L^p\)-integrable
functions over \(\mathbb{R}\) (with the usual Lebesgue measure). Its
dual space is \(L^q(\mathbb{R})\).</p>
</li>
<li>
<p>Let \(C_0(\mathbb{R})\) be the space of vanishing continuous functions
over \(\mathbb{R}\). Its dual space is \(\mathcal{M}(\mathbb{R})\), the
space of Radon measures.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>
</li>
<li>
<p>Similarly define \(L^p(\mathbb{R}^n)\), \(L^q(\mathbb{R}^n)\),
\(C_0(\mathbb{R}^n)\) spaces of multivariate functions with \(n\) scalar
variables.</p>
</li>
<li>
<p>Denote \(L^p_{\mathop{\mathrm{Sym}}}(\mathbb{R}^n)\),
\(L^q_{\mathop{\mathrm{Sym}}}(\mathbb{R}^n)\),
\(C_{0 \mathop{\mathrm{Sym}}}(\mathbb{R}^n)\) the respective (closed)
subspaces consisting of permutation-invariant functions.</p>
</li>
</ul>
<p>Note that our shorthand \(\mathop{\mathrm{Sym}}\) can be viewed as a
projection operator from \(L^p(\mathbb{R}^n)\) to
\(L^p_{\mathop{\mathrm{Sym}}}(\mathbb{R}^n)\), and from
\(C_0(\mathbb{R}^n)\) to \(C_{0 \mathop{\mathrm{Sym}}}(\mathbb{R}^n)\).</p>
<h3 id="the-result-and-why-it-looks-surprising-to-me">The result and why it looks surprising to me</h3>
<div class="proposition" text="main result">
<p>Let \(1 \leq p < \infty\). The set \(\left\lbrace
f^{\otimes n}(\boldsymbol{t}) ;~ f \in L^p(\mathbb{R})
\right\rbrace\) has its linear span dense in
\(L^p_{\mathop{\mathrm{Sym}}}(\mathbb{R}^n)\).</p>
<p>The set \(\left\lbrace
f^{\otimes n}(\boldsymbol{t}) ;~ f \in C_0(\mathbb{R})
\right\rbrace\) has its linear span dense in
\(C_{0 \mathop{\mathrm{Sym}}}(\mathbb{R}^n)\).<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>
</div>
<p>More explicitly: <em>any
\(g(\boldsymbol{t}) \in C_{0 \mathop{\mathrm{Sym}}}(\mathbb{R}^n)\) is
arbitrarily-well uniformly approximated by finite sums of the form
\(\sum_{i \leq m} f_i(t_1)...f_i(t_n)\)</em> (\(m < \infty\),
\(f_i \in C_0(\mathbb{R})\)).</p>
<p><strong>This is not Weierstrass with symmetrization</strong>
As an obvious corollary, the proposition holds when \(\mathbb{R}\) is
replaced by a closed interval \(I \subset \mathbb{R}\). In this case the
result looks like a straightforward consequence of the Weierstrass
approximation theorem, but it is not. Consider the following valid
reasoning:</p>
<blockquote>
<p>Fix a continuous function \(g(\boldsymbol{t})\) over the compact \(I^n\) and let
\(\varepsilon>0\). By the Weierstrass approximation theorem, there
exists a polynomial \(P(\boldsymbol{t})\) such that
\(\left\lVert g-P \right\rVert := \sup_{I^n} \left\lvert g-P \right\rvert \leq \varepsilon\),
and \(P(\boldsymbol{t})\) can be written as
\(P(\boldsymbol{t}) = \sum_{\alpha \in \mathbb{N}^n} a_\alpha \boldsymbol{t}^\alpha\)
(where there are only a finite number of nonzero coefficients
\(a_\alpha\) and the shorthand \(\boldsymbol{t}^\alpha\) denotes
\(t_1^{\alpha_1} ... t_n^{\alpha_n}\)).</p>
<p>If in addition \(g(\boldsymbol{t})\) is permutation-invariant, then the lemma
below shows that \(\mathop{\mathrm{Sym}}P(\boldsymbol{t})\) is also an
\(\varepsilon\)-approximation of \(g(\boldsymbol{t})\), and it can be written as
\(\mathop{\mathrm{Sym}}P(\boldsymbol{t}) = \sum_{\alpha \in \mathbb{N}^n} a_\alpha \mathop{\mathrm{Sym}}\boldsymbol{t}^\alpha
= \sum_{\alpha \in \mathbb{N}^n} \sum_{\sigma \in \mathfrak{S}_n} \frac{a_\alpha}{n!} t_1^{\alpha_{\sigma(1)}} ... t_n^{\alpha_{\sigma(n)}}.\)</p>
</blockquote>
<p>This does not show that \(g(\boldsymbol{t})\) can be approximated by a finite
combination of symmetric tensor functions, as the reasoning may yield
approximators such as \(t_1 t_2^3 + t_1^3 t_2\) (if \(n=2\)), which are not
of the required form.</p>
<div class="lemma" text="the aforementioned lemma">
<p>For any function \(g(\boldsymbol{t})\) over \(\mathbb{R}^n\) and any
\(1 \leq p \leq \infty\), it holds:
\(\left\lVert \mathop{\mathrm{Sym}}g \right\rVert_{L^p} \leq \left\lVert g \right\rVert_{L^p}\).</p>
<p>For any permutation-invariant \(g(\boldsymbol{t})\) and any function \(h(\boldsymbol{t})\),
if \(\left\lVert g - h \right\rVert_{L^p} \leq \varepsilon\), then
\(\left\lVert g - \mathop{\mathrm{Sym}}h \right\rVert_{L^p} \leq \varepsilon\).</p>
</div>
<div class="proof">
<p>Let any function \(g(\boldsymbol{t})\) over \(\mathbb{R}^n\) and any
\(1 \leq p \leq \infty\). By definition of the \(L^p\) norm,</p>
\[\left\lVert \mathop{\mathrm{Sym}}g \right\rVert_{L^p} = \left\lVert \frac{1}{n!} \sum_{\sigma \in \mathfrak{S}_n} g(\boldsymbol{t}_\sigma) \right\rVert_{L^p}
\leq \frac{1}{n!} \sum_{\sigma \in \mathfrak{S}_n} \left\lVert g(\boldsymbol{t}_\sigma) \right\rVert_{L^p}
= \left\lVert g \right\rVert_{L^p}.\]
<p>Let any permutation-invariant function \(g(\boldsymbol{t})\) and any function
\(h(\boldsymbol{t})\) such that
\(\left\lVert g - h \right\rVert_{L^p} \leq \varepsilon\). Then</p>
<p>\(\left\lVert g - \mathop{\mathrm{Sym}}h \right\rVert_{L^p} = \left\lVert \mathop{\mathrm{Sym}}(g-h) \right\rVert_{L^p} \leq \left\lVert g-h \right\rVert_{L^p} \leq \varepsilon.\)</p>
</div>
<p><strong>An example of surprise</strong>
In fact the case of permutation-invariant polynomials over a compact set
is already surprising to me... In fact just the following example is
already surprising to me:</p>
<blockquote>
<p>Consider the function \(g(\boldsymbol{t}) = t_1 + ... + t_n\) over \([0,1]^n\).
According to the proposition, there exist \(m<\infty\) and
\(f_i(t) \in C([0,1])\) (\(i \leq m\)) such that
\(g(\boldsymbol{t}) \approx \sum_{i \leq m} f_i(t_1)...f_i(t_n)\), in the sense
of uniform approximation over \([0,1]^n\).</p>
</blockquote>
<p>I wonder what these \(f_i\) could look like. Since they are continuous
over \([0,1]\), according to the Weierstrass approximation theorem we may
assume without loss of generality that each \(f_i\) is polynomial. Then,
developing the product \(f_i(t_1)...f_i(t_n)\) would yield an a priori big
polynomial, whereas directly using Weierstrass with symmetrization can
yield simply \(t_1+...+t_n\) itself.</p>
<h3 id="brief-proof-of-the-result">Brief proof of the result</h3>
<p>In this section we prove the \(L^p/L^q\) (\(1 \leq p < \infty\)) part of the
proposition; the \(C_0/\mathcal{M}\) part can be proved by the same
arguments, with minor modifications.</p>
<p>I assume the reader is familiar with the basics of functional analysis
and duality in Banach spaces. Recall the following density criterion,
which is a consequence of the Hahn-Banach theorem (as are many things):</p>
<div class="lemma" text="density criterion">
<p>Let \(E\) be a Banach space and \(A\) a subset.
\(\begin{aligned}
\mathop{\mathrm{span}}(A) \text{ is dense in } E
&& \iff &&
\left\lbrace
X \in E';~
\forall a \in A, \left\langle a, X \right\rangle_E = 0
\right\rbrace
= \{ 0_{E'} \}.
\end{aligned}\)</p>
</div>
<p>The set on the right is sometimes denoted \(A^\perp\) and called the
<em>annihilator</em> of \(A\). Note that
\(A^\perp = (\mathop{\mathrm{span}}(A))^\perp\). In the case where \(E\) is
a Euclidean space, \(A^\perp\) is just (up to isometry) the orthogonal
complement of \(\mathop{\mathrm{span}}(A)\).</p>
<p>The proposition will be proved by applying the above density criterion.
To do so we will need the following intuitively obvious lemma,
characterizing the dual of \(L^p_{\mathop{\mathrm{Sym}}}(\mathbb{R}^n)\).
A formal and rather uninteresting proof can be found
in appendix of the <a href="/contents/symmetric_funs_density.pdf">LaTeX version</a>.</p>
<div class="lemma">
<p>The dual of \(L^p_{\mathop{\mathrm{Sym}}}(\mathbb{R}^n)\) is isometrically
isomorphic to \(L^q_{\mathop{\mathrm{Sym}}}(\mathbb{R}^n)\).</p>
</div>
<p>We can now prove the proposition. The main argument is extracted from
(Boyd Chua Desoer 1984, Theorem 2.5.2) in the context of Volterra
series.</p>
<div class="proof" text="of proposition">
<p>To apply the density criterion to \(A = \left\lbrace
f^{\otimes n}(\boldsymbol{t}) ;~ f \in L^p(\mathbb{R})
\right\rbrace\) and \(E = L^p_{\mathop{\mathrm{Sym}}}(\mathbb{R}^n)\),
let \(h \in E' \simeq L^q_{\mathop{\mathrm{Sym}}}(\mathbb{R}^n)\) such
that \(\left\langle f^{\otimes n}, h \right\rangle_{L^p} = 0\) for all
\(f \in L^p(\mathbb{R})\). Let us show that \(h=0\), from which the
proposition will follow.</p>
<p>Denote \(\Phi_h: L^p(\mathbb{R}) \to \mathbb{R}\) the \(n\)-homogeneous map</p>
\[\Phi_h[f]
= \left\langle f^{\otimes n}, h \right\rangle_{L^p}
= \int_{\mathbb{R}^n} d\boldsymbol{t}~ h(t_1,...,t_n) f(t_1)...f(t_n)\]
<p>and \(\Psi_h: L^p(\mathbb{R})^n \to \mathbb{R}\) the associated \(n\)-linear
system</p>
\[\Psi_h\{f_1,...,f_n\}
= \left\langle f_1 \otimes ... \otimes f_n, h \right\rangle_{L^p}
= \int_{\mathbb{R}^n} d\boldsymbol{t}~ h(t_1,...,t_n) f_1(t_1)...f_n(t_n).\]
<p>The \(n\)-linear system \(\Psi_h\{\cdot,...,\cdot\}\) is symmetric in its
arguments, since \(h\) is permutation-invariant. So \(\Psi_h\) is completely
determined by the \(n\)-homogeneous map \(\Phi_h[\cdot]\) via the <a href="https://en.wikipedia.org/wiki/Polarization_of_an_algebraic_form">algebraic
polarization identity</a></p>
\[n! \Psi_h\{f_1,...,f_n\}
= \left.
\frac{\partial}{\partial \alpha_1 ... \partial \alpha_n}
\right|_{\boldsymbol{\alpha}=0}
\Phi_h \left[ \sum_{i=1}^n \alpha_i f_i \right],\]
<p>and the
right-hand-side is the differential of an identically zero map.
Consequently,</p>
\[\forall f_1,...,f_n \in L^p(\mathbb{R}),~ \Psi_h\{f_1,...,f_n\} = 0.\]
<p>Now evaluate this at
\(f_1(t) = \boldsymbol{1}_{t \in A_1}, ..., f_n(t) = \boldsymbol{1}_{t \in A_n}\)
for intervals \(A_i \subset \mathbb{R}\):</p>
\[\Psi_h\{f_1,...,f_n\} = \int_{\mathbb{R}^n} d\boldsymbol{t}~ h(t_1,...,t_n)
\boldsymbol{1}_{\boldsymbol{t}\in A_1 \times ... \times A_n} = 0.\]
<p>Since this holds for all \(A_i\), and hyperrectangles generate the Borel
\(\sigma\)-algebra, then \(h=0\), as claimed.</p>
</div>
<h3 id="is-the-result-interestinguseful">Is the result interesting/useful?</h3>
<p>For the subjects that I’m currently leaning towards, the result
presented in this document is actually pretty useless, as it only talks
about approximability per se. It doesn’t give any guarantees on the
nature nor the number of functions \(f_i\) required to
\(\varepsilon\)-approximate a given target function \(g\).</p>
<p>However I still find the result technically interesting and surprising.
I never heard about it before but I’m certain it must be somewhere out
there already – I would be glad to know where and in what context.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Disclaimer: the term "symmetric tensor function" may not be
consistent with standard terminology, I haven’t checked. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p><a href="https://regularize.wordpress.com/2011/11/11/dual-spaces-of-continuous-functions/">https://regularize.wordpress.com/2011/11/11/dual-spaces-of-continuous-functions/</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>I’m pretty sure the same holds if \(C_0\) is replaced by \(C_b\) i.e
if we consider bounded continuous functions, instead of vanishing
continuous. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Guillaume WangI prepared this post a long while ago but only posted it here in July. This is because I had written it in LaTeX and converting it to MD was not completely trivial. Since I had no incentive to post it, this small barrier was enough for me to procrastinate several months… I’m not completely satisfied with the math rendering, so here is the LaTeX version.