Processing math: 100%

Wiki

A universe of ideas

User Tools

Site Tools


uni:8:ml:start

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
uni:8:ml:start [2015-06-11 10:27] skrupellosuni:8:ml:start [2020-11-18 18:11] (current) – external edit 127.0.0.1
Line 43: Line 43:
   * Softmax grade action probabilities by estimated values Qt   * Softmax grade action probabilities by estimated values Qt
   * Bolzmann-distribution: π(at)=eQt(a)/τnb=1eQt(b)/τ   * Bolzmann-distribution: π(at)=eQt(a)/τnb=1eQt(b)/τ
 +
 +===== Something else =====
 +Qk=r1,r2,,rkk
 +
 +==== Incremental implementation ====
 +Qk+1=Qk+1k+1\[rk+1Qk\]
 +
 +Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate]
 +
 +==== Agent-Estimation??? ====
 +Learn a policy:
 +Policy at step t, πt is a mapping from states to action probabilities πt(s,a)=probability that ak=a when Sk=S
 +
 +Return:
 +rt+1,rt+2,
 +
 +We want to maximize the expected reward, E{Rt} for each t
 +
 +Rt=rt+1+rt+2++rT
 +
 +Discounted reward
 +Rt=rt+1+γrt+2+γ2rt+3+=k=0γkrt+k+1where0γ1
 +
 +shortsited0γ1farsited
 +
 +==== Markov Property ====
 +$$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \}
uni/8/ml/start.1434011243.txt.gz · Last modified: 2020-11-18 18:10 (external edit)