====== Maschinelles Lernen und Data Mining ======
===== Reinforcement Learning =====
^  Agent  |  -> Actions ->  ^  Environment  |
^ :::     |  <- state <-    ^ ::: |
^ :::     |  <- reward <-   ^ ::: |

  - Learner is not told what to do
  - Trial and error search
  - Delayed reward
  - We need to explore round exploit

  * **Policy** what to do
  * **Reward** what is good
  * **Value** what is good because it predicts reward
  * **Model** what follows what

==== Evaluating feedback ====
  * Evaluating actions vs. inst???
  * Example: n-armed bandit
    * 
    * evaluate feedback
  * after each play $a_t$ we got a reward r_t where $ E\{r_t \mid a_t \} = Q^*(a_t)$
  * optimize reward ??? 1000 plays
  * Exploration/
    * ??? $Q_t(a) = Q^*(a)$ action value estimate
    * ???
  * $a_t = a_t^*$ => exploitation
  * $a_t \ne a_t^*$ => exploration
== Action Value Methods ==
Suppose by the ??? play actions $a$ had been choosen $k_a$ times, producing rewards $r_1, r_2, \dots r_{k,a}$
$$Q_t(a) = \frac{r_1, r_2, \dots r_{k,a}}{k_a}$$
$$\lim_{k \rightarrow \infty} Q_t(a) = Q^*(a)$$

== \epsilon-feeding action selection ==
  * feeding: $a_t = a_t^* = avg_a max Q_t(a)$
  * $\epsilon$-feeding = $a_t^*$????
== In the 10-Armed test bed ==
  * $n=10$ possible actions
  * Each Q^*(a) is chosen rounding from N(0,1)
  * 1000 plays, avergage our 2000 experiments

== Softmax action selection ==
  * Softmax grade action probabilities by estimated values $Q_t$
  * Bolzmann-distribution: $$\pi(a_t) = \frac{e^{Q_t(a)/\tau}}{\sum^n_{b=1} e^{Q_t(b)/\tau}}$$

===== Something else =====
$$ Q_k = \frac{r_1, r_2, \ldots, r_k}{k}$$

==== Incremental implementation ====
$$Q_{k+1} = Q_k+\frac{1}{k+1}\[r_{k+1} - Q_k\]$$

Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate]

==== Agent-Estimation??? ====
Learn a policy:
Policy at step t, $\pi_t$ is a mapping from states to action probabilities $\pi_t(s,a)$=probability that $a_k=a$ when $S_k = S$

Return:
$$r_{t+1}, r_{t+2},  \ldots$$

We want to maximize the expected reward, $E\{R_t\}$ for each t

$$R_t = r_{t+1} + r_{t+2} +  \ldots + r_T$$

Discounted reward
$$R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \ldots = \sum_{k=0}^\infty \gamma^k r_{t+k+1} \text{where} 0 \le \gamma \le 1$$

$$\text{shortsited} 0 \leftarrow \gamma \rightarrow 1 \text{farsited}$$

==== Markov Property ====
$$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \}