Loading [MathJax]/jax/output/CommonHTML/jax.js

Table of Contents

Maschinelles Lernen und Data Mining

Reinforcement Learning

Agent → Actions → Environment
← state ←
← reward ←
  1. Learner is not told what to do
  2. Trial and error search
  3. Delayed reward
  4. We need to explore round exploit

Evaluating feedback

Action Value Methods

Suppose by the ??? play actions a had been choosen ka times, producing rewards r1,r2,rk,a Qt(a)=r1,r2,rk,aka limkQt(a)=Q(a)

\epsilon-feeding action selection
In the 10-Armed test bed
Softmax action selection

Something else

Qk=r1,r2,,rkk

Incremental implementation

Qk+1=Qk+1k+1\[rk+1Qk\] Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate] ==== Agent-Estimation??? ==== Learn a policy: Policy at step t, πt is a mapping from states to action probabilities πt(s,a)=probability that ak=a when Sk=S

Return: rt+1,rt+2,

We want to maximize the expected reward, E{Rt} for each t

Rt=rt+1+rt+2++rT

Discounted reward Rt=rt+1+γrt+2+γ2rt+3+=k=0γkrt+k+1where0γ1

shortsited0γ1farsited

Markov Property

$$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \}