Maschinelles Lernen und Data Mining

Reinforcement Learning

Agent	→ Actions →	Environment
	← state ←
	← reward ←

Learner is not told what to do
Trial and error search
Delayed reward
We need to explore round exploit

Policy what to do
Reward what is good
Value what is good because it predicts reward
Model what follows what

Evaluating feedback

Evaluating actions vs. inst???
Example: n-armed bandit
- evaluate feedback
after each play $a_t$ we got a reward r_t where $E\{r_t \mid a_t \} = Q^*(a_t)$
optimize reward ??? 1000 plays
Exploration/
- ??? $Q_t(a) = Q^*(a)$ action value estimate
- ???
$a_t = a_t^*$ ⇒ exploitation
$a_t \ne a_t^*$ ⇒ exploration

Action Value Methods

Suppose by the ??? play actions $a$ had been choosen $k_a$ times, producing rewards $r_1, r_2, \dots r_{k,a}$ $Q_t(a) = \frac{r_1, r_2, \dots r_{k,a}}{k_a}$ $\lim_{k \rightarrow \infty} Q_t(a) = Q^*(a)$

\epsilon-feeding action selection

feeding: $a_t = a_t^* = avg_a max Q_t(a)$
$\epsilon$ -feeding = $a_t^*$ ????

In the 10-Armed test bed

$n=10$ possible actions
Each Q^*(a) is chosen rounding from N(0,1)
1000 plays, avergage our 2000 experiments

Softmax action selection

Softmax grade action probabilities by estimated values $Q_t$
Bolzmann-distribution: $\pi(a_t) = \frac{e^{Q_t(a)/\tau}}{\sum^n_{b=1} e^{Q_t(b)/\tau}}$