Wiki

A universe of ideas

User Tools

Site Tools


uni:8:ml:start

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Last revisionBoth sides next revision
uni:8:ml:start [2015-06-11 10:27] skrupellosuni:8:ml:start [2015-06-11 11:16] – [Evaluating feedback] skrupellos
Line 43: Line 43:
   * Softmax grade action probabilities by estimated values $Q_t$   * Softmax grade action probabilities by estimated values $Q_t$
   * Bolzmann-distribution: $$\pi(a_t) = \frac{e^{Q_t(a)/\tau}}{\sum^n_{b=1} e^{Q_t(b)/\tau}}$$   * Bolzmann-distribution: $$\pi(a_t) = \frac{e^{Q_t(a)/\tau}}{\sum^n_{b=1} e^{Q_t(b)/\tau}}$$
 +
 +===== Something else =====
 +$$ Q_k = \frac{r_1, r_2, \ldots, r_k}{k}$$
 +
 +==== Incremental implementation ====
 +$$Q_{k+1} = Q_k+\frac{1}{k+1}\[r_{k+1} - Q_k\]$$
 +
 +Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate]
 +
 +==== Agent-Estimation??? ====
 +Learn a policy:
 +Policy at step t, $\pi_t$ is a mapping from states to action probabilities $\pi_t(s,a)$=probability that $a_k=a$ when $S_k = S$
 +
 +Return:
 +$$r_{t+1}, r_{t+2},  \ldots$$
 +
 +We want to maximize the expected reward, $E\{R_t\}$ for each t
 +
 +$$R_t = r_{t+1} + r_{t+2} +  \ldots + r_T$$
 +
 +Discounted reward
 +$$R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots = \sum_{k=0}^\infty \gamma^k r_{t+k+1} \text{where} 0 \le \gamma \le 1$$
 +
 +$$\text{shortsited} 0 \leftarrow \gamma \rightarrow 1 \text{farsited}$$
 +
 +==== Markov Property ====
 +$$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \}
uni/8/ml/start.txt · Last modified: 2020-11-18 18:11 by 127.0.0.1