Table of Contents

Maschinelles Lernen und Data Mining

Reinforcement Learning

Agent → Actions → Environment
← state ←
← reward ←
  1. Learner is not told what to do
  2. Trial and error search
  3. Delayed reward
  4. We need to explore round exploit

Evaluating feedback

Action Value Methods

Suppose by the ??? play actions $a$ had been choosen $k_a$ times, producing rewards $r_1, r_2, \dots r_{k,a}$ $$Q_t(a) = \frac{r_1, r_2, \dots r_{k,a}}{k_a}$$ $$\lim_{k \rightarrow \infty} Q_t(a) = Q^*(a)$$

\epsilon-feeding action selection
In the 10-Armed test bed
Softmax action selection

Something else

$$ Q_k = \frac{r_1, r_2, \ldots, r_k}{k}$$

Incremental implementation

$$Q_{k+1} = Q_k+\frac{1}{k+1}\[r_{k+1} - Q_k\]$$ Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate] ==== Agent-Estimation??? ==== Learn a policy: Policy at step t, $\pi_t$ is a mapping from states to action probabilities $\pi_t(s,a)$=probability that $a_k=a$ when $S_k = S$

Return: $$r_{t+1}, r_{t+2}, \ldots$$

We want to maximize the expected reward, $E\{R_t\}$ for each t

$$R_t = r_{t+1} + r_{t+2} + \ldots + r_T$$

Discounted reward $$R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots = \sum_{k=0}^\infty \gamma^k r_{t+k+1} \text{where} 0 \le \gamma \le 1$$

$$\text{shortsited} 0 \leftarrow \gamma \rightarrow 1 \text{farsited}$$

Markov Property

$$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \}