uni:8:ml:start
This is an old revision of the document!
−Table of Contents
Maschinelles Lernen und Data Mining
Reinforcement Learning
Agent | → Actions → | Environment |
---|---|---|
← state ← | ||
← reward ← |
- Learner is not told what to do
- Trial and error search
- Delayed reward
- We need to explore round exploit
- Policy what to do
- Reward what is good
- Value what is good because it predicts reward
- Model what follows what
Evaluating feedback
- Evaluating actions vs. inst???
- Example: n-armed bandit
- evaluate feedback
- after each play at we got a reward r_t where E{rt∣at}=Q∗(at)
- optimize reward ??? 1000 plays
- Exploration/
- ??? Qt(a)=Q∗(a) action value estimate
- ???
- at=a∗t ⇒ exploitation
- at≠a∗t ⇒ exploration
Action Value Methods
Suppose by the ??? play actions a had been choosen ka times, producing rewards r1,r2,…rk,a Qt(a)=r1,r2,…rk,aka limk→∞Qt(a)=Q∗(a)
\epsilon-feeding action selection
- feeding: at=a∗t=avgamaxQt(a)
- ϵ-feeding = a∗t????
In the 10-Armed test bed
- n=10 possible actions
- Each Q^*(a) is chosen rounding from N(0,1)
- 1000 plays, avergage our 2000 experiments
Softmax action selection
- Softmax grade action probabilities by estimated values Qt
- Bolzmann-distribution: π(at)=eQt(a)/τ∑nb=1eQt(b)/τ
uni/8/ml/start.1434011243.txt.gz · Last modified: 2020-11-18 18:10 (external edit)