−Table of Contents
Maschinelles Lernen und Data Mining
Reinforcement Learning
Agent | → Actions → | Environment |
---|---|---|
← state ← | ||
← reward ← |
- Learner is not told what to do
- Trial and error search
- Delayed reward
- We need to explore round exploit
- Policy what to do
- Reward what is good
- Value what is good because it predicts reward
- Model what follows what
Evaluating feedback
- Evaluating actions vs. inst???
- Example: n-armed bandit
- evaluate feedback
- after each play at we got a reward r_t where E{rt∣at}=Q∗(at)
- optimize reward ??? 1000 plays
- Exploration/
- ??? Qt(a)=Q∗(a) action value estimate
- ???
- at=a∗t ⇒ exploitation
- at≠a∗t ⇒ exploration
Action Value Methods
Suppose by the ??? play actions a had been choosen ka times, producing rewards r1,r2,…rk,a Qt(a)=r1,r2,…rk,aka limk→∞Qt(a)=Q∗(a)
\epsilon-feeding action selection
- feeding: at=a∗t=avgamaxQt(a)
- ϵ-feeding = a∗t????
In the 10-Armed test bed
- n=10 possible actions
- Each Q^*(a) is chosen rounding from N(0,1)
- 1000 plays, avergage our 2000 experiments
Softmax action selection
- Softmax grade action probabilities by estimated values Qt
- Bolzmann-distribution: π(at)=eQt(a)/τ∑nb=1eQt(b)/τ
Something else
Qk=r1,r2,…,rkk
Incremental implementation
Qk+1=Qk+1k+1\[rk+1−Qk\] Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate] ==== Agent-Estimation??? ==== Learn a policy: Policy at step t, πt is a mapping from states to action probabilities πt(s,a)=probability that ak=a when Sk=S
Return: rt+1,rt+2,…
We want to maximize the expected reward, E{Rt} for each t
Rt=rt+1+rt+2+…+rT
Discounted reward Rt=rt+1+γrt+2+γ2rt+3+…=∞∑k=0γkrt+k+1where0≤γ≤1
shortsited0←γ→1farsited
Markov Property
$$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \}