uni:8:ml:start
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| uni:8:ml:start [2015-06-11 09:29] – skrupellos | uni:8:ml:start [2020-11-18 18:11] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 9: | Line 9: | ||
| - Delayed reward | - Delayed reward | ||
| - We need to explore round exploit | - We need to explore round exploit | ||
| + | |||
| + | * **Policy** what to do | ||
| + | * **Reward** what is good | ||
| + | * **Value** what is good because it predicts reward | ||
| + | * **Model** what follows what | ||
| + | |||
| + | ==== Evaluating feedback ==== | ||
| + | * Evaluating actions vs. inst??? | ||
| + | * Example: n-armed bandit | ||
| + | * | ||
| + | * evaluate feedback | ||
| + | * after each play $a_t$ we got a reward r_t where $ E\{r_t \mid a_t \} = Q^*(a_t)$ | ||
| + | * optimize reward ??? 1000 plays | ||
| + | * Exploration/ | ||
| + | * ??? $Q_t(a) = Q^*(a)$ action value estimate | ||
| + | * ??? | ||
| + | * $a_t = a_t^*$ => exploitation | ||
| + | * $a_t \ne a_t^*$ => exploration | ||
| + | == Action Value Methods == | ||
| + | Suppose by the ??? play actions $a$ had been choosen $k_a$ times, producing rewards $r_1, r_2, \dots r_{k,a}$ | ||
| + | $$Q_t(a) = \frac{r_1, r_2, \dots r_{k, | ||
| + | $$\lim_{k \rightarrow \infty} Q_t(a) = Q^*(a)$$ | ||
| + | |||
| + | == \epsilon-feeding action selection == | ||
| + | * feeding: $a_t = a_t^* = avg_a max Q_t(a)$ | ||
| + | * $\epsilon$-feeding = $a_t^*$???? | ||
| + | == In the 10-Armed test bed == | ||
| + | * $n=10$ possible actions | ||
| + | * Each Q^*(a) is chosen rounding from N(0,1) | ||
| + | * 1000 plays, avergage our 2000 experiments | ||
| + | |||
| + | == Softmax action selection == | ||
| + | * Softmax grade action probabilities by estimated values $Q_t$ | ||
| + | * Bolzmann-distribution: | ||
| + | |||
| + | ===== Something else ===== | ||
| + | $$ Q_k = \frac{r_1, r_2, \ldots, r_k}{k}$$ | ||
| + | |||
| + | ==== Incremental implementation ==== | ||
| + | $$Q_{k+1} = Q_k+\frac{1}{k+1}\[r_{k+1} - Q_k\]$$ | ||
| + | |||
| + | Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate] | ||
| + | |||
| + | ==== Agent-Estimation??? | ||
| + | Learn a policy: | ||
| + | Policy at step t, $\pi_t$ is a mapping from states to action probabilities $\pi_t(s, | ||
| + | |||
| + | Return: | ||
| + | $$r_{t+1}, r_{t+2}, | ||
| + | |||
| + | We want to maximize the expected reward, $E\{R_t\}$ for each t | ||
| + | |||
| + | $$R_t = r_{t+1} + r_{t+2} + \ldots + r_T$$ | ||
| + | |||
| + | Discounted reward | ||
| + | $$R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots = \sum_{k=0}^\infty \gamma^k r_{t+k+1} \text{where} 0 \le \gamma \le 1$$ | ||
| + | |||
| + | $$\text{shortsited} 0 \leftarrow \gamma \rightarrow 1 \text{farsited}$$ | ||
| + | |||
| + | ==== Markov Property ==== | ||
| + | $$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \} | ||
uni/8/ml/start.1434007740.txt.gz · Last modified: (external edit)
