Skip to content

Commit 8f25966

Browse files
committed
Typo
1 parent 0d37728 commit 8f25966

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

rl/intro.qmd

+1-1
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ The objective is to identify a **control policy** $π = (π_1, π_2, \dots)$, wh
2727
$$ A_t = \pi_t(S_{1:t}, A_{1:t-1}).$$
2828

2929
The performance of a policy is still given the expected discounted reward under that policy, which of course depends on the model parameters. So, we the performance of policy $π$ under model $θ$ by
30-
$$ J(π,θ) = \EXP^{\pi}_{θ} \biggl[ \sum_{t=1}^{∞} R_t \biggr] $$
30+
$$ J(π,θ) = \EXP^{\pi}_{θ} \biggl[ \sum_{t=1}^{∞} γ^{t-1} R_t \biggr] $$
3131
where the notation $\EXP^{π}_{θ}$ denotes the fact that the expectation depends on the policy $π$ and model parameters $θ$.
3232

3333
## The learning objective

0 commit comments

Comments
 (0)