Wiki

A universe of ideas

User Tools

Site Tools


uni:8:ml:start

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
uni:8:ml:start [2015-06-11 09:22] skrupellosuni:8:ml:start [2015-06-11 11:16] – [Evaluating feedback] skrupellos
Line 5: Line 5:
 ^ :::      <- reward <-   ^ ::: | ^ :::      <- reward <-   ^ ::: |
  
-- Learner is not told what to do +  - Learner is not told what to do 
-- Trial and error search +  - Trial and error search 
-- Delayed reward +  - Delayed reward 
-- We need to explore round exploit+  - We need to explore round exploit 
 + 
 +  * **Policy** what to do 
 +  * **Reward** what is good 
 +  * **Value** what is good because it predicts reward 
 +  * **Model** what follows what 
 + 
 +==== Evaluating feedback ==== 
 +  * Evaluating actions vs. inst??? 
 +  * Example: n-armed bandit 
 +    *  
 +    * evaluate feedback 
 +  * after each play $a_t$ we got a reward r_t where $ E\{r_t \mid a_t \} = Q^*(a_t)$ 
 +  * optimize reward ??? 1000 plays 
 +  * Exploration/ 
 +    * ??? $Q_t(a) = Q^*(a)$ action value estimate 
 +    * ??? 
 +  * $a_t = a_t^*$ => exploitation 
 +  * $a_t \ne a_t^*$ => exploration 
 +== Action Value Methods == 
 +Suppose by the ??? play actions $a$ had been choosen $k_a$ times, producing rewards $r_1, r_2, \dots r_{k,a}$ 
 +$$Q_t(a) = \frac{r_1, r_2, \dots r_{k,a}}{k_a}$$ 
 +$$\lim_{k \rightarrow \infty} Q_t(a) = Q^*(a)$$ 
 + 
 +== \epsilon-feeding action selection == 
 +  * feeding: $a_t = a_t^* = avg_a max Q_t(a)$ 
 +  * $\epsilon$-feeding = $a_t^*$???? 
 +== In the 10-Armed test bed == 
 +  * $n=10$ possible actions 
 +  * Each Q^*(a) is chosen rounding from N(0,1) 
 +  * 1000 plays, avergage our 2000 experiments 
 + 
 +== Softmax action selection == 
 +  * Softmax grade action probabilities by estimated values $Q_t$ 
 +  * Bolzmann-distribution: $$\pi(a_t) = \frac{e^{Q_t(a)/\tau}}{\sum^n_{b=1} e^{Q_t(b)/\tau}}$$ 
 + 
 +===== Something else ===== 
 +$$ Q_k = \frac{r_1, r_2, \ldots, r_k}{k}$$ 
 + 
 +==== Incremental implementation ==== 
 +$$Q_{k+1} = Q_k+\frac{1}{k+1}\[r_{k+1} - Q_k\]$$ 
 + 
 +Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate] 
 + 
 +==== Agent-Estimation??? ==== 
 +Learn a policy: 
 +Policy at step t, $\pi_t$ is a mapping from states to action probabilities $\pi_t(s,a)$=probability that $a_k=a$ when $S_k = S$ 
 + 
 +Return: 
 +$$r_{t+1}, r_{t+2},  \ldots$$ 
 + 
 +We want to maximize the expected reward, $E\{R_t\}$ for each t 
 + 
 +$$R_t = r_{t+1} + r_{t+2} +  \ldots + r_T$$ 
 + 
 +Discounted reward 
 +$$R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots = \sum_{k=0}^\infty \gamma^k r_{t+k+1} \text{where} 0 \le \gamma \le 1$$ 
 + 
 +$$\text{shortsited} 0 \leftarrow \gamma \rightarrow 1 \text{farsited}$$ 
 + 
 +==== Markov Property ==== 
 +$$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \}
uni/8/ml/start.txt · Last modified: 2020-11-18 18:11 by 127.0.0.1