uni:8:ml:start
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
uni:8:ml:start [2015-06-11 10:27] – skrupellos | uni:8:ml:start [2020-11-18 18:11] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 43: | Line 43: | ||
* Softmax grade action probabilities by estimated values $Q_t$ | * Softmax grade action probabilities by estimated values $Q_t$ | ||
* Bolzmann-distribution: | * Bolzmann-distribution: | ||
+ | |||
+ | ===== Something else ===== | ||
+ | $$ Q_k = \frac{r_1, r_2, \ldots, r_k}{k}$$ | ||
+ | |||
+ | ==== Incremental implementation ==== | ||
+ | $$Q_{k+1} = Q_k+\frac{1}{k+1}\[r_{k+1} - Q_k\]$$ | ||
+ | |||
+ | Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate] | ||
+ | |||
+ | ==== Agent-Estimation??? | ||
+ | Learn a policy: | ||
+ | Policy at step t, $\pi_t$ is a mapping from states to action probabilities $\pi_t(s, | ||
+ | |||
+ | Return: | ||
+ | $$r_{t+1}, r_{t+2}, | ||
+ | |||
+ | We want to maximize the expected reward, $E\{R_t\}$ for each t | ||
+ | |||
+ | $$R_t = r_{t+1} + r_{t+2} + \ldots + r_T$$ | ||
+ | |||
+ | Discounted reward | ||
+ | $$R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots = \sum_{k=0}^\infty \gamma^k r_{t+k+1} \text{where} 0 \le \gamma \le 1$$ | ||
+ | |||
+ | $$\text{shortsited} 0 \leftarrow \gamma \rightarrow 1 \text{farsited}$$ | ||
+ | |||
+ | ==== Markov Property ==== | ||
+ | $$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \} |
uni/8/ml/start.txt · Last modified: 2020-11-18 18:11 by 127.0.0.1