強化學習2023版第一講 德梅萃·P. 博賽卡斯(Dimitri P. Bert

On-Line Play algorithm.
Online tree search.
So, search all the moves and determine the final values. Determine the move based on the final values.
以果決行。
Off-Line Training in AlphaZero: Approximation Policy Iteration (PI)
a value neural net through training
a policy neural net through training
on-line player plays better than the off-line-trained player.
Central role of Newton's method?
mathematical connection?
跳了這部分。
Reference page.
Terminology.
RL uses Max/Value
DP uses Min/Cost
- Reward of a stage = (Opposite of ) cost of a stage
- State value = (Opposite of) State cost
- Value (or state-value) function = opposite of Cost function
Controlled system terminology
- Agent = Decision maker or controller
- Action = Decision or control
- Environment = Dynamic system
Methods terminology
- Learning = Solving a DP-related problem using simulation
- Self-learning (or self-play in the context of games) = Solving a DP problem using simulation-based policy iteration.
- Planning v.s. Learning distinction = Solving a DP problem with model based v.s. model-free simulation
Notations.
two types: transition probability/discrete-time system equation.
Finite Horizon Deterministic Optimal Control Model
a system ends at stage x_N.

A Special Case: Finite Number of States and Controls.

主要就是說也是shortest path...
Principle of Optimality:
THE TAIL OF AN OPTIMAL SEQUENCE IS OPTIMAL FOR THE TAIL SUBPROBLEM.
If there exists a better solution for the tail subproblem, then we will take that part instead of the current one. Hence, the principle of optimality holds.
From One Tail Subproblem to the Next.
I think for this part, it is to tell us that we can use backward method to solve the problem...
DP Algorithm: Solves all tail subproblems efficiently by using the Principle of Optimality.

中間講了兩個例了跳了。
General Discrete Optimization.

Connect DP to Reinforcement Learning..

Use approximation J^\tilda s instead of J^\star s. (off-line training)
Generate all the approximations.
Then, going forward, to find u^\tilda_k (on-line play)
Extentions:
Stochastic finite horizon problems: x_{k+1} is random
Infinite horizon problems: instead of ending at stage N...
Stochastic partial state information problems:
do not know the state information perfectly
MINIMAX/game problems
課程要求~跳啦