Reinforcement Learning (Advanced Dynamic Programming)
Instructor: Dr. Rajesh Ganesan
Eng Bldg. Room 2217
Phone: (703) 993-1693
Fax: (703) 993-1521
Email: rganesan at gmu dot edu
https://www.ornl.gov/news/powered-math-generative-ai-requires-new-knowledge-safe-use
https://spinningup.openai.com/en/latest/index.html
https://www.technossus.com/the-mathematics-behind-generative-ai-decoding-the-algorithms-and-models/
Week 1Text Book: https://onlinelibrary.wiley.com/doi/book/10.1002/9781118029176 by Warren Powell
Other References:
Examples for DP/Approx DP Read chapter by Paul Werbos in Jennie Si. et al. pages 3-44
Weeks 2 - 3
Chapters 1-3 DP
DP refresher Notes Classnotes 1 Classnotes 2
DP example question
Excel Example value iteration for MDP
Weeks 4 - 6
RL/ADP motivation Notes
Chapter 4 - RL Models with value iteration Classnotes 3 Classnotes 4
Pre -Decision State Models
RL/ADP Dialect 1 - Asynchronous update with TPM - use of small case or little v, use pre decision states (Fig 4.2 page 120)
RL/ADP Dialect 2 - Q- learning around pre decison state (no TPM but use smaple realizations from a simulator, simulator uses uniform probability to generate next state). use of small case or little q, Robbins-Monroe Stochatic Approximation Scheme, Use of learning parameter alpha, use pre decision states
https://www.v7labs.com/blog/deep-reinforcement-learning-guide
Deep Q Learning
**********
SARSA - evaluate a policy
RTDP- start from an optimistic value of a state (same as ADP Dialect 1). Instead of all V(S) = 0, set all V(S) to some optimizatic values for all states.
*********
RL/ADP Dialect 3 - Asynchronous update (with TPM. A simulator uses TPM to generate the next state), use of small case or little v, which uses TPM in its calculation, Robbins-Monroe Stochatic Approximation Scheme, Use of learning parameter alpha, use pre decision states (Fig 4.4 page 128)
RL/ADP Dialect 3.1 - Asynchronous update (with no TPM but use smaple realizations from a simulator, simulator uses uniform probability to generate next state, hopefully the simulator can get close to the TPM), use of small case or little v, which uses transition probability (in this case from an uniform distribution) in its calculation. Robbins-Monroe Stochatic Approximation Scheme, Use of learning parameter alpha, use pre decision states (Fig 4.4 page 128)
Post -Decision State Models
RL/ADP Dialect 4 - Asynchronous update (No TPM but use smaple realizations from a simulator, simulator uses uniform probability to generate next state), use of small case or little v, which does not use any transition probability in its calculation. Robbins-Monroe Stochatic Approximation Scheme, Use of learning parameter alpha, use post decision states(Fig 4.7 page 141)
RL/ADP Dialect 5 - Q- learning around post decison state (no TPM but use smaple realizations from a simulator, simulator uses uniform probability to generate next state). use of small case or little q, Robbins-Monroe Stochatic Approximation Scheme, Use of learning parameter alpha, use post decision states
Summary of models and coding in MATLAB
Pre-decision state
DP- Value iteration Discounted cost criteria, DP, uses TPM, sync update so all possible next states are evaluated in each iteration
RL/ADP Dialect 1 - Figure 4.2 page 120 value iteration_ADP, uses TPM, V=v, async update, TPM used for finding next state
RL/ADP Dialect 2 - Q learning value iteration_ADP_Q, no TPM, uses Pre-decision state, Q(S,a)= (1-alpha)Q(S,a)+ alpha q, async update, need an uncertainty model (simulator) for finding next state
RL/ADP Dialect 3 - Figure 4.4 page 128 value iteration_ADP2, uses TPM, V = (1-alpha)V + alpha v, async update, TPM used for finding next state. TPM Prob used in v calculation
RL/ADP Dialect 3.1 - Figure 4.4 page 128 value iteration_ADP2_noTPM, no TPM, V = (1-alpha)V + alpha v, async update, Simulator with uniform distribution probability used for finding next state. Uniform distribution Prob used in v calculation
Post-decision state: Prob not used in v calaculation
RL/ADP Dialect 4 - Figure 4.7 page 141 value iteration_ADP 3, no TPM, uses Post-decision state, V = (1-alpha)V + alpha v, async update, need an uncertainty model for finding next state
Implementation of RL Algorithms
Week 6
Algorithm steps for RL Models
1. Initialization
2. Chapter 7 Stopping criteria with Mean square error - Stochatic gradient
Fig 4.7 page 141 with MSE page 255 value iteration_ADP3x.m for machine replacement problem. Classnotes 4
Week 7
3. Chapter 11 - alpha decay over number of iterations
Matlab - alpha decay
Chapter 5 - Definining 4. state, 5. action, 6. uncertainty model and state transition
********
Importance of a good uncertainty simulator
Inventory control Example, see hand out, excel for inventory control
Pre-decision state - Prob used in v calculationwithout true TPM, instedd use Uniform distribution in uncertainty model to generate the next state (Matlab using fig 4.7 pg 141) with uniform demand probability. No knowledge of uncertainty is available
********
Week 8
7. Chapter 12 - Exploration, Exploitation (learning), 8. pick a model (Q(S,a) - learning, V(S)-learning (pre or post decision state))
9. Contribution function - multi-objective
Week 9
10. Chapter 6 - policy representation - Lookahead policy (decision tree, Stochatic Programming), Policy function approximation, value function approximation, obtaining policy in the learnt (implementation) stage
Week 10
11. Chapter 8 - Value Function Appoximation
matlab code with 4 schemes
12. Chapter 9 Value of a policy: learnt (testing) where action a learnt per state S are implemented to get objective function from the mean of the stochatic gradient with beta (discount parameter) = 1. Alternatively, collect C(S,a) for each S visited in 10000 Markov jumps. Collect frequency of visit to each S in those 10000 jumps. Find probability of visit to S (Prob (visit to S) = frequency/10000). Find expected cost/reward per unit time = sum (prob(visit to S) * C(S,a)) for all S.
Temporal
difference for finite T problems. Relation between Temporal difference and
stochastic gradient
Week 11
Algorithm steps - summary
VFA with diffusion wavelet (not in the book. We can talk offine if you are interested)
VFA DW Theory
Diffusion wavelets DW code for best basis - multiple levels - MATLAB
Excel to show value determination
ADP with scaling and wavelet functions code - MATLAB
VFA Steps implementation
Week 12
RL Models with policy iteration
https://www.geeksforgeeks.org/machine-learning/types-of-reinforcement-learning/
Pure Policy gradient
chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf
TRPO/PPO
https://spinningup.openai.com/en/latest/algorithms/trpo.html#background
https://spinningup.openai.com/en/latest/algorithms/ppo.html
How to evaluate the gradient equation in a policy gradient reinforcement learning algorithm
Hybrid approches
Actor-critic methods
Assume policy and implement -actor
Policy evaluation (value determination of all states (fewer number of states)) - critic
Policy improvement (policy determination) and implementation actor
Stochastic DP - Infinite horizon notes (from OR 674)
policy iteration for discounted criteria
policy iteration excel sheet Machine replacement problem (from OR 674)
Chapter 10 Actor-critic Page 420 of 647 for large number of states
https://www.geeksforgeeks.org/machine-learning/actor-critic-algorithm-in-reinforcement-learning/
https://medium.com/@hassaanidrees7/reinforcement-learning-vs-0744a860ffa7
SAC - soft actor critic https://spinningup.openai.com/en/latest/algorithms/sac.html
DPG - Deterministic policy gradient
DQN+DPG = DDPG
Q learning- DQN- DPG- DDPG
https://cse.buffalo.edu/~avereshc/rl_fall19/lecture_21_Actor_Critic_DPG_DDPG.pdf
(DDPG in Matlab) https://www.mathworks.com/help/reinforcement-learning/ug/ddpg-agents.html#mw_086ee5c6-c185-4597-aefc-376207c6c24c
https://2020blogfor.github.io/posts/2020/04/rlddpg/
MIT book https://direct.mit.edu/books/oa-monograph-pdf/2111075/book_9780262374026.pdf
Chapters 13, 14, and 15
Week 14
Project Presentation - Dec 4th Thursday 7:20 PM
*******************************************************************************
Project - Individual or 2 in a group. Email me your group, title and a short decription by Nov 30th
Option 1: Pick one application and prepare a 5-10 minute overview for presenting in class from the following books. Please prepare slides. Provide a 1-2 page write up by Dec 11 via email only.
For slides and report use this structure: Back ground describing the problem, objective, state and its variables, action variables, uncertainty and how the state trasitions to another state, reward/penatly - contribution function, transition probability, reference etc.
https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470544785 - Hand book of ADP Jennie Si et al. (eds) - See part III for applications. You may also use part II.
https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118029176 - Test book ADP by Warren Powell - Chapter 14
http://incompleteideas.net/book/bookdraft2017nov5.pdf - RL by Sutton and Barto - Chapter 16 - computer games
Option 2: Coding. I can give you a problem or you can select.
Please prepare slides. Provide a 1-2 page write up by Dec 11 via email only.
For slides and report use this structure: Back ground describing the problem, objective, state and its variables, action variables, uncertainty and how the state trasitions to another state, reward/penatly - contribution function, transition probability, reference etc.
********************************************************************************