forked from RussTedrake/underactuated
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpolicy_search.html
579 lines (475 loc) · 30.4 KB
/
policy_search.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
<!DOCTYPE html>
<html>
<head>
<title>Ch. 11 - Policy
Search</title>
<meta name="Ch. 11 - Policy
Search" content="text/html; charset=utf-8;" />
<link rel="canonical" href="http://underactuated.mit.edu/policy_search.html" />
<script src="https://hypothes.is/embed.js" async></script>
<script type="text/javascript" src="chapters.js"></script>
<script type="text/javascript" src="htmlbook/book.js"></script>
<script src="htmlbook/mathjax-config.js" defer></script>
<script type="text/javascript" id="MathJax-script" defer
src="htmlbook/MathJax/es5/tex-chtml.js">
</script>
<script>window.MathJax || document.write('<script type="text/javascript" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js" defer><\/script>')</script>
<link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
<script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
<script>hljs.initHighlightingOnLoad();</script>
<link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
</head>
<body onload="loadChapter('underactuated');">
<div data-type="titlepage">
<header>
<h1><a href="index.html" style="text-decoration:none;">Underactuated Robotics</a></h1>
<p data-type="subtitle">Algorithms for Walking, Running, Swimming, Flying, and Manipulation</p>
<p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
<p style="font-size: 14px; text-align: right;">
© Russ Tedrake, 2023<br/>
Last modified <span id="last_modified"></span>.</br>
<script>
var d = new Date(document.lastModified);
document.getElementById("last_modified").innerHTML = d.getFullYear() + "-" + (d.getMonth()+1) + "-" + d.getDate();</script>
<a href="misc.html">How to cite these notes, use annotations, and give feedback.</a><br/>
</p>
</header>
</div>
<p><b>Note:</b> These are working notes used for <a
href="https://underactuated.csail.mit.edu/Spring2023/">a course being taught
at MIT</a>. They will be updated throughout the Spring 2023 semester. <a
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture videos are available on YouTube</a>.</p>
<table style="width:100%;"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=trajopt.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=planning.html>Next Chapter</a></td>
</tr></table>
<script type="text/javascript">document.write(notebook_header('policy_search'))
</script>
<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 10"><h1>Policy
Search</h1>
<p>In our case study on <a href="trajopt.html#perching">perching
aircraft</a>, we solved a challenging control problem, but our approach to
control design was based on only linear optimal control (around an optimized
trajectory). We've also discussed some approaches to nonlinear optimal
control that could scale beyond small, discretized state spaces. These were
based on estimating the cost-to-go function, including
<a href="dp.html#function_approximation">value iteration using function
approximation</a> and <a href="lyapunov.html#adp">approximate dynamic
programming as a linear program or as a sums-of-squares program</a>. </p>
<p>There are a lot of things to like about methods that estimate the
cost-to-go function (aka value function). The cost-to-go function reduces
the long-term planning problem into a one-step planning problem; it encodes
all relevant information about the future (and nothing more). The HJB gives
us optimality conditions for the cost-to-go that give us a strong algorithmic
handle to work with. <!--But there is, I think, one very reasonable concern
about value-based methods: we have seen many examples where the optimal
cost-to-go function can be a very complicated function (by most measures of
complexity; e.g. it would require a very fine discretization or high-degree
polynomial to approximate accurately), even when the dynamics are smooth (or
low-degree). The swing-up problem for the pendulum is a nice example: the
dynamics are smooth and simple, and even with a quadratic cost, the optimal
policy is discontinuous (there are neighboring states that will take one
"pump" or two pumps to reach the goal).--></p>
<p>In this chapter, we will explore another very natural idea: let us
parameterize a controller with some decision variables, and then search over
those decision variables directly in order to achieve a task and/or optimize
a performance objective. One specific motivation for this approach is the
(admittedly somewhat anecdotal) observation that often times simple policies
can perform very well on complicated robots and with potentially complicated
cost-to-go functions.</p>
<p>We'll refer to this broad class of methods as "policy search" or, when
optimization methods are used, we'll sometimes use "policy optimization".
This idea has not received quite as much attention from the controls
community, probably because we know many relatively simple cases where it
does not work well. But it has become very popular again lately due to the
empirical success of "policy gradient" algorithms in reinforcement learning
(RL). This chapter includes a discussion of the "model-based" version of
these RL policy-gradient algorithms; we'll describe their "model-free"
versions in <a href="rl_policy_search.html">a future chapter</a>.
</p>
<section><h1>Problem formulation</h1>
<p>Consider a static full-state feedback policy, $$\bu =
\bpi_\balpha(\bx),$$ where $\bpi$ is potentially a nonlinear function, and
$\balpha$ is the vector of parameters that describe the controller. The
control might take time as an input, or might even have it's own internal
state, but let's start with this simple form. </p>
<p>Using our prescription for optimal control using additive costs, we can
evaluate the performance of this controller from any initial condition
using, e.g.: \begin{align*} J_\balpha(\bx) =& \int_0^\infty \ell(\bx(t),
\bu(t)) dt, \\ \subjto \quad & \dot\bx = f(\bx, \bu), \quad \bu =
\bpi_\balpha(\bx), \quad \bx(0) = \bx.\end{align*} In order to provide a
scalar cost for each set of policy parameters, $\balpha$, we need one more
piece: a relative importance for the different initial conditions.</p>
<p>As we will further elaborate when we discuss <a
href="robust.html">stochastic optimal control</a>, a very natural choice --
one that preserves the recursive structure of the HJB -- is to optimize the
expected value of the cost, given some distribution over initial
conditions: $$\min_\balpha E_{\bx \sim {\mathcal X}_0} \left[J_\balpha(\bx)
\right],$$ where ${\mathcal X}_0$ is a probability distribution over
initial conditions, $\bx(0).$</p>
</section>
<section id="lqr"><h1>Linear Quadratic Regulator</h1>
<p>To start thinking about the problem of searching directly in the policy
parameters, it's very helpful to start with a problem we know and can
understand well. In LQR problems for linear, time-invariant systems, we
know that the optimal policy is a linear function: $\bu = -{\bf K}\bx.$
So far, we have always obtained ${\bf K}$ indirectly -- by solving a
Riccati equation to find the cost-to-go and then backing out the optimizing
policy. Here, let us study the case where we parameterize the elements of
${\bf K}$ as decision variables, and attempt to optimize the expected
cost-to-go directly.</p>
<subsection><h1>Policy Evaluation</h1>
<p>First, let's evaluate our objective for a given ${\bf K}$. This step
is known as "<i>policy evaluation</i>". If we use a Gaussian with mean
zero and covariance ${\bf \Omega}$ as our distribution over initial
conditions, then for LQR we have \begin{align*} & E\left[ \int_0^\infty
[\bx^T{\bf Q}\bx + \bu^T\bR\bu] dt \right], \\ \subjto \quad & \dot\bx =
{\bf A}\bx + {\bf B}\bu, \quad \bu = - {\bf K}\bx, \quad \bx(0) \sim
\mathcal{N}(0, {\bf \Omega}).\end{align*} This problem is also known as
the $\mathcal{H}_2$ optimal control problem<elib>Chen15a</elib>, since the
expected-cost-to-go here is the $\mathcal{H}_2$-norm of the linear system
from disturbance input (here only an impulse that generates our initial
conditions) to performance output (which here would be e.g. $\bz =
\begin{bmatrix} \bQ^{\frac{1}{2}}\bx \\ \bR^{\frac{1}{2}}\bu
\end{bmatrix}).$</p>
<p>To evaluate a policy $\bK$, let us first re-arrange the cost function
slightly, using the properties of the matrix trace: \begin{gather*}
\bx^T\bQ\bx + \bx^T\bK^T\bR\bK\bx = \bx^T(\bQ + \bK^T\bR\bK)\bx^T =
\trace\left((\bQ + \bK^T\bR\bK)\bx\bx^T\right), \end{gather*} and the
linearity of the integral and expected value: $$E\left[ \int_0^\infty
\trace((\bQ + \bK^T\bR\bK)\bx\bx^T) dt \right] = \trace\left((\bQ +
\bK^T\bR\bK) E \left[\int_0^\infty \bx\bx^T dt \right]\right),$$ For any
given initial condition, the solution of the closed-loop dynamics is
given by the matrix exponential: $$\bx(t) = e^{(\bA - \bB\bK)t}\bx(0).$$
For the distribution of initial conditions, we have \begin{gather*} \\
E\left[\bx(t)\bx(t)^T\right] = e^{(\bA - \bB\bK)t}
E\left[\bx(0)\bx(0)^T\right] e^{(\bA - \bB\bK)^Tt} = e^{(\bA - \bB\bK)t}
{\bf \Omega} e^{(\bA - \bB\bK)^Tt}, \end{gather*} which is just a
(symmetric) matrix function of $t$. The integral of this function, call
it ${\bf X}$, represents the expected 'energy' of the closed-loop
response: $${\bf X} = E\left[ \int_0^\infty \bx \bx^T dt \right].$$
Assuming $\bK$ is stabilizing, ${\bf X}$ can be computed as the (unique)
solution to the Lyapunov equation: $$(\bA - \bB\bK){\bf X} + {\bf X}(\bA
- \bB\bK)^T + {\bf \Omega} = 0.$$ (You can see a closely related derivation <a href="https://en.wikipedia.org/wiki/Controllability_Gramian">here</a>). Finally, the total policy evaluation
is given by \begin{equation}E_{\bx \sim \mathcal{N}(0,{\bf \Omega})}
[J_\bK(\bx)] = \trace\left((\bQ + \bK^T\bR\bK){\bf
X}\right)\label{eq:lqr_evaluation}.\end{equation}</p>
</subsection>
<subsection id="lqr_lmi"><h1>A nonconvex objective in ${\bf K}$</h1>
<p>Unfortunately, the Lyapunov equation represents a nonlinear constraint
(on the pair $\bK$, ${\bf X}$). Indeed, it is well known that even the
set of controllers that stabilizing a linear systems can be nonconvex in
the parameters $\bK$ when there are 3 or more state
variables<elib>Fazel18</elib>.</p>
<example><h1>The set of stabilizing $\bK$ can be non-convex</h1>
<p>The following example was given in <elib>Fazel18</elib>. Consider a
discrete-time linear system with $\bA = {\bf I}_{3 \times 3}, \bB =
{\bf I}_{3 \times 3}$. The controllers given by $${\bf K}_1 =
\begin{bmatrix} 1 & 0 & -10 \\ -1 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}
\quad \text{and} \quad \bK_2 = \begin{bmatrix} 1 & -10 & 0 \\ 0 & 1 & 0
\\ -1 & 0 & 1 \end{bmatrix},$$ are both stabilizing controllers for
this system (all eigenvalues of $\bA - \bB\bK$ are inside the unit
circle of the complex plane). However the controller $\hat{\bK} =
(\bK_1 + \bK_2)/2$ has two eigenvalues outside the unit circle.
</p>
</example>
<p>Since the set of controllers that achieve finite total cost is
non-convex, clearly the cost function we consider here is also
non-convex.</p>
<p>As an aside, for this problem we do actually know a change of
variables that make the problem convex. Let's introduce a new variable
${\bf Y} = {\bf KX}.$ Since ${\bf X}$ is PSD, we can back out $\bK =
{\bf YX}^{-1}.$ Now we can rewrite the optimization: \begin{align*}
\min_{{\bf X}, {\bf Y}} & \quad \trace{\bQ^\frac{1}{2} {\bf X}
\bQ^\frac{1}{2}} + \trace{\bR^\frac{1}{2} {\bf Y} {\bf X}^{-1} {\bf Y}^T
\bR^\frac{1}{2}} \\ \subjto & \quad \bA{\bf X} - \bB{\bf Y} + {\bf
X}\bA^T - {\bf Y}^T\bB^T + {\bf \Omega} = 0, \\ & \quad {\bf X} \succ
0.\end{align*} The second term in the objective appears to be nonconvex,
but is actually convex. In order to write it as a SDP, we can replace it
exactly with one more slack variable, ${\bf Z}$, and a Schur
complement<elib>Johnson07</elib>: \begin{align*} \min_{{\bf X}, {\bf Y},
{\bf Z}} & \quad \trace{\bQ^\frac{1}{2} {\bf X} \bQ^\frac{1}{2}} +
\trace{\bf Z} \\ \subjto & \quad \bA{\bf X} - \bB{\bf Y} + {\bf X}\bA^T -
{\bf Y}^T\bB^T + {\bf \Omega} = 0, \\ & \quad \begin{bmatrix} {\bf Z} &
\bR^\frac{1}{2} {\bf Y} \\ {\bf Y}^T \bR^\frac{1}{2} & {\bf X}
\end{bmatrix} \succeq 0.\end{align*}</p>
<p>Nevertheless, our original question is asking about searching directly
in the original parameterization, $\bK$. If the objective in nonconvex in those parameters, then how should we perform the search?</p>
</subsection>
<subsection><h1>No local minima</h1>
<p>Although convexity is sufficient to guarantee that an optimization
landscape does not have any local minima, it is not actually necessary.
<elib>Fazel18</elib> showed that for this LQR objective, all local optima
are in fact global optima. This analysis was extended in
<elib>Mohammadi19</elib> to give a simpler analysis and include
convergence rates.</p>
<p>How does one show that an optimization landscape has no local minima
(even though it may be non-convex)? One of the most popular tools is to
demonstrate <i>gradient dominance</i> with the famous
<i>Polyak-Lojasiewicz (PL) inequality</i> <elib>Karimi16</elib>. For an
optimization problem $$\min_{\bx \in \Re^d} f(\bx)$$ we first assume the
function $f$ is $L$-smooth (<a
href="https://en.wikipedia.org/wiki/Lipschitz_continuity">Lipschitz</a>
gradients): $$\forall \bx,\bx', \quad \| \nabla f(\bx') - \nabla f(\bx)
\|_2 \le L \| \bx' - \bx \|_2.$$ Then we say that the function satisfies
the PL-inequality if the following holds for some $\mu > 0$: $$\forall
\bx, \quad \frac{1}{2} \| \nabla f(\bx) \|_F^2 \ge \mu (f(\bx) - f^*),$$
where $f^*$ is a value obtained at the optima. In words, the gradient of
the objective must grow faster than the gradient of a quadratic function.
Note, however, that the distance here is measured in $f(\bx)$, not $\bx$;
we do not require (nor imply) that the optimal solution is unique. It
clearly implies that for any minima, $\bx'$ with $\nabla f(x') = 0$,
since the left-hand side is zero, we must have the right-hand side also
be zero, so $x'$ is also a global optima: $f(x') = f^*$. </p>
<!-- another way is to establish quasi-convexity ... etc -->
<example><h1>A nonconvex function with no local minima</h1>
<p>Consider the function $$f(x) = x^2 + 3 \sin^2(x).$$</p>
<figure>
<iframe id="igraph" scrolling="no" style="border:none;"
seamless="seamless" src="data/pl_inequality.html" height="250"
width="100%"></iframe>
</figure>
<p>We can establish that this function is not convex by observing that
for $a=\frac{\pi}{4}$, $b=\frac{3\pi}{4}$, we have
$$f\left(\frac{a+b}{2}\right) = \frac{\pi^2}{4} + 3 \approx 5.47 >
\frac{f(a)+f(b)}{2} = \frac{5\pi^2}{16} + \frac{3}{2} \approx 4.58.$$
</p>
<p>We can establish the PL conditions using the gradient $$\nabla f(x)
= 2x + 6 \sin(x) \cos(x).$$ We can establish that this function is
$L$-smooth with $L=8$ by $$\| \nabla f(b) - \nabla f(a) \|_2 = |2b - 2a
+ 6\sin(b-a)| \le 8|b - a|,$$ because $\sin(x) \le x.$ Finally, we
have gradient-dominance from the PL-inequality:
$$\frac{1}{2}(2x+6\sin(x)\cos(x))^2 \ge \mu(x^2 + 3\sin^2(x)),$$ with
$\mu=0.175$. (I confirmed this with a small
<a href="https://dreal.github.io/">dReal</a> <a
href="policy_search/pl_inequality.py">program</a>).
<figure>
<iframe id="igraph" scrolling="no" style="border:none;"
seamless="seamless" src="data/pl_inequality_grad.html" height="250"
width="100%"></iframe>
</figure>
</p>
</example>
<p><elib>Karimi16</elib> gives a convergence rate for convergence to an
optima for <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a> given the PL conditions.
<elib>Mohammadi19</elib> showed that the gradients of the LQR cost we
examine here with respect to $\bK$ satisfy the PL conditions on any
sublevel set of the cost-to-go function.</p>
</subsection>
<subsection><h1>True gradient descent</h1>
<p>The results described above suggest that one can use gradient descent
to obtain the optimal controller, $\bK^*$ for LQR. For the variations
we've seen so far (where we know the model), I would absolutely recommend
that solving the Riccati equations is a much better algorithm; it is
faster and more robust, with no parameters like step-size to tune. But
gradient descent becomes more interesting / viable when we think of it as
a model for a less perfect algorithm, e.g. where the plant model is not
given and the gradients are estimated from noisy samples.</p>
<p>It is a rare luxury, due here to our ability to integrate the linear
plants/controllers, quadratic costs, and Gaussian initial conditions,
that we could compute the value function exactly in
(\ref{eq:lqr_evaluation}). We can also compute the true gradient -- this
is a pinnacle of exactness we should strive for in our methods but will
rarely achieve again. The gradient is given by $$\pd{E[J_\bK(\bx)]}{\bK}
= 2(\bR\bK - \bB^T{\bf P}){\bf X},$$ where ${\bf P}$ satisfies another
Lyapunov equation: $$(\bA - \bB\bK)^T{\bf P} + {\bf P}(\bA - \bB\bK) +
\bQ + \bK^T\bR\bK = 0.$$
</p>
<todo>cite Jack if/when we publish our draft</todo>
<p>Note that the term <i>policy gradient</i> used in reinforcement
learning typically refers to the slightly different class of algorithms I
hinted at above. In those algorithms, we use the true gradients of the
policy (only), but estimate the remainder of the terms in the gradient
through sampling. These methods typically require many samples to
estimate the gradients we compute here, and should only be weaker (less
efficient) than the algorithms in this chapter. The papers investigating
the convergence of gradient descent for LQR have also started exploring
these cases. We will study these so-called <a
href="rl_policy_search.html">model-free" policy search</a>
algorithms soon.</p>
</subsection>
</section>
<section><h1>More convergence results and counter-examples</h1>
<p>LQR / $\mathcal{H}_2$ control is one of the good cases, where we know
that for the objective parameterized directly in $\bK$, all local optima
are global optima. <elib>Zhang20</elib> extended this result for mixed
$\mathcal{H}_2/\mathcal{H}_\infty$ control. <elib>Agarwal20a</elib> gives a
recent treatment of the tabular (finite) MDP case. </p>
<p>For LQR, we also know alternative parameterizations of the controller
which make the objective actually convex, including the <a
href="lqr.html#lmi">LMI formulation</a> and the <a
href="lqr.html#sls">Youla parameterization</a>. Their utility in a policy
search setting was studied initially in <elib>Roberts11</elib>.</p>
<p>Unfortunately, we do not expect these nice properties to hold in
general. There are a number of nearby problems which are known to be
nonconvex in the original parameters. The case of <i>static output
feedback</i> is an important one. If we extend our plant model to include
(potentially limited) observations: $\dot\bx = \bA\bx + \bB\bu, \by =
\bC\bx,$ then searching directly over controllers, $\bu = -{\bf K}\by$, is
known to be NP-hard<elib>Blondel97</elib>. This time, the set of stabilizing ${\bf K}$ matrices may be not only nonconvex, but actually disconnected. We can see that with a simple example (given to me once
during a conversation with Alex Megretski).
</p>
<example id="static_output_feedback">
<h1>Parameterizations of Static Output Feedback</h1>
<p>Consider the single-input, single-output LTI system $$\dot{\bx} = {\bf
A}\bx + {\bf B} u, \quad y = {\bf C}\bx,$$ with $${\bf A} = \begin{bmatrix}
0 & 0 & 2 \\ 1 & 0 & 0 \\ 0 & 1 & 0\end{bmatrix}, \quad {\bf B} =
\begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \quad {\bf C} = \begin{bmatrix}
1 & 1 & 3 \end{bmatrix}.$$ Here the linear static-output-feedback policy
can be written as $u = -ky$, with a single scalar parameter $k$. </p>
<p>Here is a plot of the maximum eigenvalue (real-part) of the closed-loop
system, as a function of $k$. The system is only stable when this maximum
value is less than zero. You'll find the set of stabilizing $k$'s is a
disconnected set.</p>
<figure><img width="80%" src="data/static_feedback_disconnected.svg"/></figure>
</example>
<p>There are many other known counter-examples. <elib>Bhandari20</elib>
gives a particularly simple one with a two state MDP. One of the big open
questions is whether deep network parameterizations are an example of a
good case.</p>
</section>
<section><h1>Trajectory-based policy search</h1>
<p>Although we cannot expect gradient descent to converge to a global
minima in general, it is still very reasonable to try using gradient
descent to find policies for more complicated nonlinear control problems.
In the general form, this means that the first step of optimizing
$$\min_\balpha E_{\bx \sim {\mathcal X}_0} \left[J_\balpha(\bx) \right],$$
is estimating $$\pd{}{\balpha} E_{\bx \sim {\mathcal X}_0}
\left[J_\balpha(\bx) \right].$$ In the LQR problem, we were able to
compute these terms exactly; with the biggest simplification coming from
the fact that <a href="stochastic.html#linear_gaussian">the response of a
linear system to Gaussian initial conditions stays Gaussian</a>. This is
not true for more general nonlinear systems. So what are we to do?</p>
<p>The most common/general technique (despite it not being very efficient),
is to approximate the expected cost-to-go using a sampling (Monte-Carlo)
approximation using a large number, $N$, of samples: $$E_{\bx \sim
{\mathcal X}_0}[J_\balpha(\bx)] \approx \frac{1}{N} \sum_{i=0}^{N-1}
J_\balpha(\bx_i), \quad \bx_i \sim \mathcal{X}_0.$$ The gradients follow
easily: $$\pd{}{\balpha} E_{\bx \sim {\mathcal X}_0}[J_\balpha(\bx)]
\approx \frac{1}{N} \sum_{i=0}^{N-1} \pd{J_\balpha(\bx_i)}{\balpha}, \quad
\bx_i \sim \mathcal{X}_0.$$ Our confidence in the accuracy of this
estimate will improve as we increase $N$; see e.g. Section 4.2 of
<elib>Rubinstein16</elib> for details on the confidence intervals. The
most optimistic among us will say that it's quite fine to have only a noisy
estimate of the gradient -- this leads to stochastic gradient descent which
can have some very desirable properties. But it does make the algorithm
harder to analyze and debug.</p>
<p>Using the Monte-Carlo estimator, the total gradient update is just a sum
over gradients with respect to particular initial conditions,
$\pd{J_\balpha(\bx_i)}{\balpha}.$ But for finite-horizon objectives, these
are precisely the gradients that we have already studied in the context of
trajectory optimization. They can be computed efficiently using an adjoint
method. The difference is that here we think of $\balpha$ as the
parameters of a feedback controller, whereas before we thought of them as
the parameters of a trajectory, but this makes no difference to the chain
rule.</p>
<example><h1>LQR with true gradients vs approximate gradients</h1></example>
<example><h1>Pendulum Swing-Up And Balance</h1>
<figure><img width="80%" src="data/trajectory_gradient_descent_diagram.svg"/></figure>
<script>document.write(notebook_link('policy_search', d=deepnote, link_text="", notebook="policy_search"))</script>
</example>
<todo>
Reference Guided Policy Search, etc (see Robert V's review);
PILCO and PIPPS (arxiv from Doya).
</todo>
<subsection><h1>Infinite-horizon objectives</h1>
<todo> by adding an approximate cost-to-go as the terminal cost.</todo>
</subsection>
<subsection><h1>Search strategies for global optimization</h1>
<p>To combat local minima... Evolutionary strategies, ...</p>
<todo>Evolutionary strategies with gradients?</todo>
<todo>Bayesian optimization? Others?</todo>
</subsection>
</section>
<section><h1>Policy Iteration</h1>
<p>In the chapter on Lyapunov analysis we also explored a handful
techniques <a href="lyapunov.html#control">for control design</a>. <a
href="lyapunov.html#control_alternations">One of these</a> even directly
parameterized the controller (as a polynomial feedback), and used
alternations to switch between policy evaluation and policy optimization.
<a href="lyapunov.html#adp">Another</a> generated lower-bounds to the
optimal cost-to-go.</p>
<p>Coming soon...</p>
<todo>DK iterations</todo>
<todo>LSPI (e.g. w/ linear function approximation), Zhouran, etc.</todo>
<todo>Shen's work?</todo>
</section>
</chapter>
<!-- EVERYTHING BELOW THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<div id="references"><section><h1>References</h1>
<ol>
<li id=Chen15a>
<span class="author">Ben M. Chen</span>,
<span class="title">"H2 {Optimal} {Control}"</span>,
<span class="publisher">Encyclopedia of {Systems} and {Control}</span> , pp. 515--520, <span class="year">2015</span>.
</li><br>
<li id=Fazel18>
<span class="author">Maryam Fazel and Rong Ge and Sham M. Kakade and Mehran Mesbahi</span>,
<span class="title">"Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator"</span>,
<span class="publisher">International Conference on Machine Learning</span> , <span class="year">2018</span>.
</li><br>
<li id=Johnson07>
<span class="author">Erik A Johnson and Baris Erkus</span>,
<span class="title">"Dissipativity and performance analysis of smart dampers via LMI synthesis"</span>,
<span class="publisher">Structural Control and Health Monitoring: The Official Journal of the International Association for Structural Control and Monitoring and of the European Association for the Control of Structures</span>, vol. 14, no. 3, pp. 471--496, <span class="year">2007</span>.
</li><br>
<li id=Mohammadi19>
<span class="author">Hesameddin Mohammadi and Armin Zare and Mahdi Soltanolkotabi and Mihailo R Jovanovi{\'c}</span>,
<span class="title">"Global exponential convergence of gradient methods over the nonconvex landscape of the linear quadratic regulator"</span>,
<span class="publisher">2019 IEEE 58th Conference on Decision and Control (CDC)</span> , pp. 7474--7479, <span class="year">2019</span>.
</li><br>
<li id=Karimi16>
<span class="author">Hamed Karimi and Julie Nutini and Mark Schmidt</span>,
<span class="title">"Linear convergence of gradient and proximal-gradient methods under the polyak-{\l}ojasiewicz condition"</span>,
<span class="publisher">Joint European Conference on Machine Learning and Knowledge Discovery in Databases</span> , pp. 795--811, <span class="year">2016</span>.
</li><br>
<li id=Zhang20>
<span class="author">Kaiqing Zhang and Bin Hu and Tamer Basar</span>,
<span class="title">"Policy Optimization for ℋ₂ Linear Control with ℋ∞ Robustness Guarantee: Implicit Regularization and Global Convergence"</span>,
<span class="publisher">Proceedings of the 2nd Conference on Learning for Dynamics and Control</span> , vol. 120, pp. 179--190, 10--11 Jun, <span class="year">2020</span>.
</li><br>
<li id=Agarwal20a>
<span class="author">Alekh Agarwal and Sham M. Kakade and Jason D. Lee and Gaurav Mahajan</span>,
<span class="title">"On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift"</span>,
, <span class="year">2020</span>.
</li><br>
<li id=Roberts11>
<span class="author">John Roberts and Ian Manchester and Russ Tedrake</span>,
<span class="title">"Feedback Controller Parameterizations for Reinforcement Learning"</span>,
<span class="publisher">Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)</span> , <span class="year">2011</span>.
[ <a href="http://groups.csail.mit.edu/robotics-center/public_papers/Roberts11.pdf">link</a> ]
</li><br>
<li id=Blondel97>
<span class="author">Vincent Blondel and John N Tsitsiklis</span>,
<span class="title">"{NP}-hardness of some linear control design problems"</span>,
<span class="publisher">SIAM journal on control and optimization</span>, vol. 35, no. 6, pp. 2118--2127, <span class="year">1997</span>.
</li><br>
<li id=Bhandari20>
<span class="author">Jalaj Bhandari and Daniel Russo</span>,
<span class="title">"Global {Optimality} {Guarantees} {For} {Policy} {Gradient} {Methods}"</span>,
<span class="publisher">arXiv:1906.01786 [cs, stat]</span>, oct, <span class="year">2020</span>.
</li><br>
<li id=Rubinstein16>
<span class="author">Reuven Y Rubinstein and Dirk P Kroese</span>,
<span class="title">"Simulation and the Monte Carlo method"</span>, John Wiley \& Sons
, vol. 10, <span class="year">2016</span>.
</li><br>
</ol>
</section><p/>
</div>
<table style="width:100%;"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=trajopt.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=planning.html>Next Chapter</a></td>
</tr></table>
<div id="footer">
<hr>
<table style="width:100%;">
<tr><td><a href="https://accessibility.mit.edu/">Accessibility</a></td><td style="text-align:right">© Russ
Tedrake, 2023</td></tr>
</table>
</div>
</body>
</html>