Processing math: 100%

Wiki

A universe of ideas

User Tools

Site Tools


uni:8:ml:start

Maschinelles Lernen und Data Mining

Reinforcement Learning

Agent → Actions → Environment
← state ←
← reward ←
  1. Learner is not told what to do
  2. Trial and error search
  3. Delayed reward
  4. We need to explore round exploit
  • Policy what to do
  • Reward what is good
  • Value what is good because it predicts reward
  • Model what follows what

Evaluating feedback

  • Evaluating actions vs. inst???
  • Example: n-armed bandit
    • evaluate feedback
  • after each play at we got a reward r_t where E{rtat}=Q(at)
  • optimize reward ??? 1000 plays
  • Exploration/
    • ??? Qt(a)=Q(a) action value estimate
    • ???
  • at=at ⇒ exploitation
  • atat ⇒ exploration
Action Value Methods

Suppose by the ??? play actions a had been choosen ka times, producing rewards r1,r2,rk,a Qt(a)=r1,r2,rk,aka limkQt(a)=Q(a)

\epsilon-feeding action selection
  • feeding: at=at=avgamaxQt(a)
  • ϵ-feeding = at????
In the 10-Armed test bed
  • n=10 possible actions
  • Each Q^*(a) is chosen rounding from N(0,1)
  • 1000 plays, avergage our 2000 experiments
Softmax action selection
  • Softmax grade action probabilities by estimated values Qt
  • Bolzmann-distribution: π(at)=eQt(a)/τnb=1eQt(b)/τ

Something else

Qk=r1,r2,,rkk

Incremental implementation

Qk+1=Qk+1k+1\[rk+1Qk\] Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate] ==== Agent-Estimation??? ==== Learn a policy: Policy at step t, πt is a mapping from states to action probabilities πt(s,a)=probability that ak=a when Sk=S

Return: rt+1,rt+2,

We want to maximize the expected reward, E{Rt} for each t

Rt=rt+1+rt+2++rT

Discounted reward Rt=rt+1+γrt+2+γ2rt+3+=k=0γkrt+k+1where0γ1

shortsited0γ1farsited

Markov Property

$$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \}

uni/8/ml/start.txt · Last modified: 2020-11-18 18:11 by 127.0.0.1