Processing math: 100%

Wiki

A universe of ideas

User Tools

Site Tools


uni:8:ml:start

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
uni:8:ml:start [2015-06-11 09:22] skrupellosuni:8:ml:start [2020-11-18 18:11] (current) – external edit 127.0.0.1
Line 5: Line 5:
 ^ :::      <- reward <-   ^ ::: | ^ :::      <- reward <-   ^ ::: |
  
-- Learner is not told what to do +  - Learner is not told what to do 
-- Trial and error search +  - Trial and error search 
-- Delayed reward +  - Delayed reward 
-- We need to explore round exploit+  - We need to explore round exploit 
 + 
 +  * **Policy** what to do 
 +  * **Reward** what is good 
 +  * **Value** what is good because it predicts reward 
 +  * **Model** what follows what 
 + 
 +==== Evaluating feedback ==== 
 +  * Evaluating actions vs. inst??? 
 +  * Example: n-armed bandit 
 +    *  
 +    * evaluate feedback 
 +  * after each play at we got a reward r_t where E{rtat}=Q(at) 
 +  * optimize reward ??? 1000 plays 
 +  * Exploration/ 
 +    * ??? Qt(a)=Q(a) action value estimate 
 +    * ??? 
 +  * at=at => exploitation 
 +  * atat => exploration 
 +== Action Value Methods == 
 +Suppose by the ??? play actions a had been choosen ka times, producing rewards r1,r2,rk,a 
 +Qt(a)=r1,r2,rk,aka 
 +limkQt(a)=Q(a) 
 + 
 +== \epsilon-feeding action selection == 
 +  * feeding: at=at=avgamaxQt(a) 
 +  * ϵ-feeding = at???? 
 +== In the 10-Armed test bed == 
 +  * n=10 possible actions 
 +  * Each Q^*(a) is chosen rounding from N(0,1) 
 +  * 1000 plays, avergage our 2000 experiments 
 + 
 +== Softmax action selection == 
 +  * Softmax grade action probabilities by estimated values Qt 
 +  * Bolzmann-distribution: π(at)=eQt(a)/τnb=1eQt(b)/τ 
 + 
 +===== Something else ===== 
 +Qk=r1,r2,,rkk 
 + 
 +==== Incremental implementation ==== 
 +Qk+1=Qk+1k+1\[rk+1Qk\] 
 + 
 +Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate] 
 + 
 +==== Agent-Estimation??? ==== 
 +Learn a policy: 
 +Policy at step t, πt is a mapping from states to action probabilities πt(s,a)=probability that ak=a when Sk=S 
 + 
 +Return: 
 +rt+1,rt+2, 
 + 
 +We want to maximize the expected reward, E{Rt} for each t 
 + 
 +Rt=rt+1+rt+2++rT 
 + 
 +Discounted reward 
 +Rt=rt+1+γrt+2+γ2rt+3+=k=0γkrt+k+1where0γ1 
 + 
 +shortsited0γ1farsited 
 + 
 +==== Markov Property ==== 
 +$$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \}
uni/8/ml/start.1434007336.txt.gz · Last modified: 2020-11-18 18:10 (external edit)