-
Notifications
You must be signed in to change notification settings - Fork 6
/
gtd.py
121 lines (105 loc) · 4.21 KB
/
gtd.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
"""
Gradient-TD(λ) Learning Algorithm, via Adam White's doctoral thesis, pg. 47.,
and Maei's doctoral thesis pg. 74 and 91-92 for the original derivation and
analysis.
Note that the algorithm is referred to as TDC(λ) or GTD(λ) in that work,
whereas GTD and GTD2 refer to variations on the same idea but without
eligibility traces.
The advantage of GTD(λ) is its stability in the off-policy setting, at the
expense of worse sample efficiency and therefore slower learning.
The step-sizes have to be chosen carefully, however, since there is some risk of
divergence.
Rules of thumb might be setting `beta` to `alpha/100` or similar; `alpha` might
also have to be set to a smaller value than you would use with TD(λ).
In Latex, the update equations look like:
δ_{t} = R_{t+1} + γ_{t+1} w_{t}^T x_{t+1} - w_{t}^{T} x_{t}
e_{t} = ρ_{t} (λ_{t} γ_{t} e_{t-1} + x_{t})
w_{t+1} = w_{t} + α[ δ_{t} e_{t} + γ_{t+1} (1 - λ_{t}) ( e_{t}^{T} h_{t} ) x_{t+1} ]
h_{t+1} = h_{t} + β[ δ_{t} e_{t} - ( h_{t}^{T} x_{t} ) x_{t} ]
Where:
- δ refers to the temporal difference error
- γ is the discount parameter
- λ is the bootstrapping parameter
- α and β are step-size parameters
- w and h are weight vectors
- e is the eligibility trace
- x and r are feature vectors and rewards respectively
"""
import numpy as np
class GTD:
"""Gradient Temporal Difference Learning, or GTD(λ). Suitable for
off-policy learning, but with typically lower sample efficiency than TD(λ).
Attributes
----------
n : int
The number of features (and therefore the length of the weight vector).
e : Vector[float]
The eligibility trace vector.
w : Vector[float]
The weight vector.
h : Vector[float]
The gradient adjustment weight vector.
Notes
-----
See page 74 and 91-92 of Maei's thesis for definition of the algorithm.
"""
def __init__(self, n):
"""Initialize the learning algorithm.
Parameters
-----------
n : int
The number of features, i.e. expected length of the feature vector.
"""
self.n = n
self.e = np.zeros(self.n)
self.w = np.zeros(self.n)
self.h = np.zeros(self.n)
def get_value(self, x):
"""Get the approximate value for feature vector `x`."""
return np.dot(self.w, x)
def update(self, x, r, xp, alpha, beta, gm, gm_p, lm, lm_p, rho):
"""Update from new experience, i.e. from a transition `(x,r,xp)`.
Parameters
----------
x : array_like
The observation/features from the current timestep.
r : float
The reward from the transition.
xp : array_like
The observation/features from the next timestep.
alpha : float
The step-size parameter for updating the weight vector.
beta : float
The step-size parameter for updating the correction weights.
gm : float
Gamma, abbreviated `gm`, the discount factor for the current state.
gm_p : float
The discount factor for the next state.
lm : float
Lambda, abbreviated `lm`, is the bootstrapping parameter for the
current timestep.
lm_p: float
The bootstrapping parameter for the next timestep.
rho : float
The importance sampling ratio between the target policy and the
behavior policy for the current timestep.
Returns
-------
delta : float
The temporal difference error from the update.
Notes
-----
Features (`x` and `xp`) are assumed to be 1D arrays of length `self.n`.
Other parameters are floats but are generally expected to be in the
interval [0, 1].
"""
delta = r + gm_p*np.dot(self.w, xp) - np.dot(self.w, x)
self.e = rho*(lm*gm*self.e + x)
self.w += alpha*(delta*self.e - gm_p*(1-lm_p)*np.dot(self.e, self.h)*xp)
self.h += beta*(delta*self.e - np.dot(self.h, x)*x)
return delta
def reset(self):
"""Reset weights, traces, and other parameters."""
self.e = np.zeros(self.n)
self.w = np.zeros(self.n)
self.h = np.zeros(self.n)