Agent | → Actions → | Environment |
---|---|---|
← state ← | ||
← reward ← |
Suppose by the ??? play actions a had been choosen ka times, producing rewards r1,r2,…rk,a Qt(a)=r1,r2,…rk,aka limk→∞Qt(a)=Q∗(a)
Qk=r1,r2,…,rkk
Qk+1=Qk+1k+1\[rk+1−Qk\] Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate] ==== Agent-Estimation??? ==== Learn a policy: Policy at step t, πt is a mapping from states to action probabilities πt(s,a)=probability that ak=a when Sk=S
Return: rt+1,rt+2,…
We want to maximize the expected reward, E{Rt} for each t
Rt=rt+1+rt+2+…+rT
Discounted reward Rt=rt+1+γrt+2+γ2rt+3+…=∞∑k=0γkrt+k+1where0≤γ≤1
shortsited0←γ→1farsited
$$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \}