Wiki

A universe of ideas

User Tools

Site Tools


uni:8:ml:start

This is an old revision of the document!


Maschinelles Lernen und Data Mining

Reinforcement Learning

Agent → Actions → Environment
← state ←
← reward ←
  1. Learner is not told what to do
  2. Trial and error search
  3. Delayed reward
  4. We need to explore round exploit
  • Policy what to do
  • Reward what is good
  • Value what is good because it predicts reward
  • Model what follows what

Evaluating feedback

  • Evaluating actions vs. inst???
  • Example: n-armed bandit
    • evaluate feedback
  • after each play $a_t$ we got a reward r_t where $ E\{r_t \mid a_t \} = Q^*(a_t)$
  • optimize reward ??? 1000 plays
  • Exploration/
    • ??? $Q_t(a) = Q^*(a)$ action value estimate
    • ???
  • $a_t = a_t^*$ ⇒ exploitation
  • $a_t \ne a_t^*$ ⇒ exploration
Action Value Methods

Suppose by the ??? play actions $a$ had been choosen $k_a$ times, producing rewards $r_1, r_2, \dots r_{k,a}$ $$Q_t(a) = \frac{r_1, r_2, \dots r_{k,a}}{k_a}$$ $$\lim_{k \rightarrow \infty} Q_t(a) = Q^*(a)$$

\epsilon-feeding action selection
  • feeding: $a_t = a_t^* = avg_a max Q_t(a)$
  • $\epsilon$-feeding = $a_t^*$????
In the 10-Armed test bed
  • $n=10$ possible actions
  • Each Q^*(a) is chosen rounding from N(0,1)
  • 1000 plays, avergage our 2000 experiments
Softmax action selection
  • Softmax grade action probabilities by estimated values $Q_t$
  • Bolzmann-distribution: $$\pi(a_t) = \frac{e^{Q_t(a)/\tau}}{\sum^n_{b=1} e^{Q_t(b)/\tau}}$$
uni/8/ml/start.1434011243.txt.gz · Last modified: 2020-11-18 18:10 (external edit)