uni:8:ml:start
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
uni:8:ml:start [2015-06-11 09:29] – skrupellos | uni:8:ml:start [2020-11-18 18:11] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 9: | Line 9: | ||
- Delayed reward | - Delayed reward | ||
- We need to explore round exploit | - We need to explore round exploit | ||
+ | |||
+ | * **Policy** what to do | ||
+ | * **Reward** what is good | ||
+ | * **Value** what is good because it predicts reward | ||
+ | * **Model** what follows what | ||
+ | |||
+ | ==== Evaluating feedback ==== | ||
+ | * Evaluating actions vs. inst??? | ||
+ | * Example: n-armed bandit | ||
+ | * | ||
+ | * evaluate feedback | ||
+ | * after each play at we got a reward r_t where E{rt∣at}=Q∗(at) | ||
+ | * optimize reward ??? 1000 plays | ||
+ | * Exploration/ | ||
+ | * ??? Qt(a)=Q∗(a) action value estimate | ||
+ | * ??? | ||
+ | * at=a∗t => exploitation | ||
+ | * at≠a∗t => exploration | ||
+ | == Action Value Methods == | ||
+ | Suppose by the ??? play actions a had been choosen ka times, producing rewards r1,r2,…rk,a | ||
+ | Qt(a)=r1,r2,…rk,aka | ||
+ | limk→∞Qt(a)=Q∗(a) | ||
+ | |||
+ | == \epsilon-feeding action selection == | ||
+ | * feeding: at=a∗t=avgamaxQt(a) | ||
+ | * ϵ-feeding = a∗t???? | ||
+ | == In the 10-Armed test bed == | ||
+ | * n=10 possible actions | ||
+ | * Each Q^*(a) is chosen rounding from N(0,1) | ||
+ | * 1000 plays, avergage our 2000 experiments | ||
+ | |||
+ | == Softmax action selection == | ||
+ | * Softmax grade action probabilities by estimated values Qt | ||
+ | * Bolzmann-distribution: | ||
+ | |||
+ | ===== Something else ===== | ||
+ | Qk=r1,r2,…,rkk | ||
+ | |||
+ | ==== Incremental implementation ==== | ||
+ | Qk+1=Qk+1k+1\[rk+1−Qk\] | ||
+ | |||
+ | Common form: NewEstimate == OldEstimate + StepSize[Target - OldEstimate] | ||
+ | |||
+ | ==== Agent-Estimation??? | ||
+ | Learn a policy: | ||
+ | Policy at step t, πt is a mapping from states to action probabilities πt(s,a)=probability that ak=a when Sk=S | ||
+ | |||
+ | Return: | ||
+ | rt+1,rt+2,… | ||
+ | |||
+ | We want to maximize the expected reward, E{Rt} for each t | ||
+ | |||
+ | Rt=rt+1+rt+2+…+rT | ||
+ | |||
+ | Discounted reward | ||
+ | Rt=rt+1+γrt+2+γ2rt+3+…=∞∑k=0γkrt+k+1where0≤γ≤1 | ||
+ | |||
+ | shortsited0←γ→1farsited | ||
+ | |||
+ | ==== Markov Property ==== | ||
+ | $$Pr\{s_{t+1} = s', r_{t+1} = r \mid s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \ldots s_0, a_0, r_0 \} = Pr \{s_{t+1} = s', r_{t+1} = 1 \mid s_t, a_t \} |
uni/8/ml/start.1434007740.txt.gz · Last modified: 2020-11-18 18:10 (external edit)