Agent | → Actions → | Environment |
---|---|---|
← state ← | ||
← reward ← |
Suppose by the ??? play actions $a$ had been choosen $k_a$ times, producing rewards $r_1, r_2, \dots r_{k,a}$ $$Q_t(a) = \frac{r_1, r_2, \dots r_{k,a}}{k_a}$$ $$\lim_{k \rightarrow \infty} Q_t(a) = Q^*(a)$$
$$ Q_k = \frac{r_1, r_2, \ldots, r_k}{k}$$
$$Q_{k+1} = Q_k+\frac{1}{k+1}\[r_{k+1} - Q_k\]$$ Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate] ==== Agent-Estimation??? ==== Learn a policy: Policy at step t, $\pi_t$ is a mapping from states to action probabilities $\pi_t(s,a)$=probability that $a_k=a$ when $S_k = S$
Return: $$r_{t+1}, r_{t+2}, \ldots$$
We want to maximize the expected reward, $E\{R_t\}$ for each t
$$R_t = r_{t+1} + r_{t+2} + \ldots + r_T$$
Discounted reward $$R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots = \sum_{k=0}^\infty \gamma^k r_{t+k+1} \text{where} 0 \le \gamma \le 1$$
$$\text{shortsited} 0 \leftarrow \gamma \rightarrow 1 \text{farsited}$$
$$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \}