Reinforcement Learning (Advanced Dynamic Programming)

 

Instructor: Dr. Rajesh Ganesan

    

Eng Bldg. Room 2217

Phone: (703) 993-1693                                                    

Fax: (703) 993-1521                                                                                 

Email: rganesan at gmu dot edu

https://www.heinz.cmu.edu/media/2023/July/generative-ai-is-a-math-problem-left-unchecked-it-could-be-a-real-problem

https://www.ornl.gov/news/powered-math-generative-ai-requires-new-knowledge-safe-use

https://spinningup.openai.com/en/latest/index.html

https://www.technossus.com/the-mathematics-behind-generative-ai-decoding-the-algorithms-and-models/

https://vinodsblog.com/2023/01/08/decoding-the-math-behind-powerful-generative-ai/#:~:text=At%20its%20heart%2C%20Generative%20AI,increasingly%20realistic%20and%20beautiful%20creations.

https://en.wikipedia.org/wiki/Generative_artificial_intelligence#:~:text=Transformers%20became%20the%20foundation%20for,traditional%20recurrent%20and%20convolutional%20models.

Week 1

Syllabus

Text Book: https://onlinelibrary.wiley.com/doi/book/10.1002/9781118029176 by Warren Powell

Other References:

 

Big picture

Examples for DP/Approx DP     Read chapter by Paul Werbos in Jennie Si. et al. pages 3-44

Weeks  2 - 3

Chapters 1-3 DP

DP refresher   Notes        Classnotes 1        Classnotes 2

DP example question  

Excel Example value iteration for MDP

 

 

Weeks 4 - 6

 

RL/ADP motivation Notes

 

Chapter 4 - RL Models with value iteration   Classnotes 3   Classnotes 4

 

Pre -Decision State Models

RL/ADP Dialect 1 - Asynchronous update with TPM - use of small case or little v, use pre decision states (Fig 4.2 page 120)

RL/ADP Dialect 2 - Q- learning around pre decison state (no TPM but use smaple realizations from a simulator, simulator uses uniform probability to generate next state). use of small case or little q, Robbins-Monroe Stochatic Approximation Scheme, Use of learning parameter alpha, use pre decision states

 

https://www.v7labs.com/blog/deep-reinforcement-learning-guide

Deep Q Learning

 

**********

SARSA - evaluate a policy

RTDP- start from an optimistic value of a state (same as ADP Dialect 1). Instead of all V(S) = 0, set all V(S) to some optimizatic values for all states.

*********

 

 

RL/ADP Dialect 3 - Asynchronous update (with TPM.  A simulator uses TPM to generate the next state), use of small case or little v, which uses TPM in its calculation, Robbins-Monroe Stochatic Approximation Scheme, Use of learning parameter alpha, use pre decision states (Fig 4.4 page 128)

 

RL/ADP Dialect 3.1 - Asynchronous update (with no TPM but use smaple realizations from a simulator, simulator uses uniform probability to generate next state, hopefully the simulator can get close to the TPM), use of small case or little v, which uses transition probability (in this case from an uniform distribution) in its calculation. Robbins-Monroe Stochatic Approximation Scheme, Use of learning parameter alpha, use pre decision states (Fig 4.4 page 128)

 

Post -Decision State Models

 

RL/ADP Dialect 4 - Asynchronous update (No TPM but use smaple realizations from a simulator, simulator uses uniform probability to generate next state), use of small case or little v, which does not use any transition probability in its calculation. Robbins-Monroe Stochatic Approximation Scheme, Use of learning parameter alpha, use post decision states(Fig 4.7 page 141)

RL/ADP Dialect 5 - Q- learning around post decison state (no TPM but use smaple realizations from a simulator, simulator uses uniform probability to generate next state). use of small case or little q, Robbins-Monroe Stochatic Approximation Scheme, Use of learning parameter alpha, use post decision states

 

 

Summary of models and coding in MATLAB

 

Pre-decision state

DP- Value iteration Discounted cost criteria, DP, uses TPM, sync update so all possible next states are evaluated in each iteration

RL/ADP Dialect 1 - Figure 4.2 page 120 value iteration_ADP, uses TPM, V=v, async update, TPM used for finding next state

RL/ADP Dialect 2 - Q learning value iteration_ADP_Q, no TPM, uses Pre-decision state, Q(S,a)= (1-alpha)Q(S,a)+ alpha q, async update, need an uncertainty model (simulator) for finding next state

RL/ADP Dialect 3 - Figure 4.4 page 128 value iteration_ADP2, uses TPM, V = (1-alpha)V + alpha v, async update, TPM used for finding next state. TPM Prob used in v calculation

RL/ADP Dialect 3.1 - Figure 4.4 page 128 value iteration_ADP2_noTPM, no TPM, V = (1-alpha)V + alpha v, async update, Simulator with uniform distribution probability used for finding next state. Uniform distribution Prob used in v calculation

 

Post-decision state:  Prob not used in v calaculation

RL/ADP Dialect 4 - Figure 4.7 page 141 value iteration_ADP 3, no TPM, uses Post-decision state, V = (1-alpha)V + alpha v, async update, need an uncertainty model for finding next state

 

Implementation of RL Algorithms

Week 6

Algorithm steps for RL Models

 

1.  Initialization

 

2. Chapter 7 Stopping criteria with Mean square error - Stochatic gradient

Fig 4.7 page 141 with MSE page 255 value iteration_ADP3x.m for machine replacement problem.   Classnotes 4

 

Week 7

 

3. Chapter 11 - alpha decay over number of iterations

Matlab - alpha decay

 

Chapter 5 - Definining 4. state, 5. action, 6. uncertainty model and state transition

 

********

Importance of a good uncertainty simulator

Inventory control Example,  see hand out,     excel for inventory control  

Pre-decision state - Prob used in v calculation
with TPM from Poisson Distribution used in uncertainty model to generate the next state, and in v calculation. Knowledge of uncertainty is available Matlab using Figure 4.4 page 128
without true TPM, instead use Uniform distribution in uncertainty model to generate the next state, and in v calculation. No knowledge of uncertainty is available Matlab using Figure 4.4 page 128


Post-decision state - Prob not used in v calaculation
with TPM from Poisson Distribution used in uncertainty model to generate the next state (Matlab
using fig 4.7 pg 141)  with poisson (lambda =1) demand probability from the question. Knowledge of uncertainty is available

without true TPM, instedd use Uniform distribution in uncertainty model to generate the next state (Matlab using fig 4.7 pg 141) with uniform demand probability. No knowledge of uncertainty is available

********

 

Week 8

 

7. Chapter 12 - Exploration, Exploitation (learning), 8. pick a model (Q(S,a) - learning, V(S)-learning (pre or post decision state))

9. Contribution function - multi-objective

 

Week 9

10. Chapter 6 - policy representation - Lookahead policy (decision tree, Stochatic Programming), Policy function approximation, value function approximation, obtaining policy in the learnt (implementation) stage

 

Week 10

 

11. Chapter 8 - Value Function Appoximation

matlab code with 4 schemes

 

12. Chapter 9 Value of a policy: learnt (testing) where action a learnt per state S are implemented to get objective function from the mean of the stochatic gradient with beta (discount parameter) = 1. Alternatively, collect C(S,a) for each S visited in 10000 Markov jumps. Collect frequency of visit to each S in those 10000 jumps. Find probability of visit to S (Prob (visit to S) = frequency/10000). Find expected cost/reward per unit time =  sum (prob(visit to S) * C(S,a)) for all S.

 

Temporal difference for finite T problems. Relation between Temporal difference and stochastic gradient

Week 11

 

Algorithm steps - summary

 

VFA with diffusion wavelet (not in the book. We can talk offine if you are interested)

VFA DW Theory

Diffusion wavelets DW code for best basis - multiple levels - MATLAB

Excel to show value determination

ADP with scaling and wavelet functions code - MATLAB

VFA Steps implementation

 

Week 12

RL Models with policy iteration

https://www.geeksforgeeks.org/machine-learning/types-of-reinforcement-learning/

 

Pure Policy gradient

chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

TRPO/PPO

https://spinningup.openai.com/en/latest/algorithms/trpo.html#background

https://spinningup.openai.com/en/latest/algorithms/ppo.html

 

How to evaluate the gradient equation in a policy gradient reinforcement learning algorithm

 

Hybrid approches

 

Actor-critic methods

Assume policy and implement -actor

Policy evaluation (value determination of all states (fewer number of states)) - critic

Policy improvement (policy determination) and implementation actor

 

Stochastic DP - Infinite horizon notes (from OR 674)

policy iteration for discounted criteria

policy iteration excel sheet   Machine replacement problem (from OR 674)

 

Chapter 10 Actor-critic Page 420 of 647 for large number of states

 

https://www.geeksforgeeks.org/machine-learning/actor-critic-algorithm-in-reinforcement-learning/

https://medium.com/@soroush.avval/actor-critic-methods-in-reinforcement-learning-a-review-f403cd3784b3

https://medium.com/intro-to-artificial-intelligence/the-actor-critic-reinforcement-learning-algorithm-c8095a655

 

https://medium.com/@hassaanidrees7/reinforcement-learning-vs-0744a860ffa7

SAC - soft actor critic https://spinningup.openai.com/en/latest/algorithms/sac.html

 

DPG - Deterministic policy gradient

DQN+DPG = DDPG

Q learning- DQN- DPG- DDPG

 

https://cse.buffalo.edu/~avereshc/rl_fall19/lecture_21_Actor_Critic_DPG_DDPG.pdf

(DDPG in Matlab) https://www.mathworks.com/help/reinforcement-learning/ug/ddpg-agents.html#mw_086ee5c6-c185-4597-aefc-376207c6c24c

https://2020blogfor.github.io/posts/2020/04/rlddpg/

https://markus-x-buchholz.medium.com/deep-reinforcement-learning-deep-deterministic-policy-gradient-ddpg-algoritm-5a823da91b43#:~:text=In%20DDPG%2C%20the%20Actor%20is,maximizing%20reward%20through%20gradient%20ascent.

c14

 

Distributional RL   - 

MIT book https://direct.mit.edu/books/oa-monograph-pdf/2111075/book_9780262374026.pdf

 

Multiagent RL

 

 

RL Algorithms List

 

Chapters 13, 14, and 15

 

ADP summary

 

 

 

 

Week 14

 

Project Presentation - Dec 4th Thursday 7:20 PM

*******************************************************************************

Project - Individual or 2 in a group.  Email me your group, title and a short decription by Nov 30th

 

Option 1: Pick one application and prepare a 5-10 minute overview for presenting in class from the following books. Please prepare slides. Provide a 1-2 page write up by Dec 11 via email only.

 

For slides and report use this structure: Back ground describing the problem, objective, state and its variables, action variables, uncertainty and how the state trasitions to another state, reward/penatly - contribution function, transition probability, reference etc.

 

https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470544785    - Hand book of ADP Jennie Si et al. (eds)    - See part III for applications. You may also use part II.

https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118029176   - Test book ADP by Warren Powell   - Chapter 14 

http://incompleteideas.net/book/bookdraft2017nov5.pdf   - RL by Sutton and Barto    - Chapter 16  -  computer games

Option 2: Coding. I can give you a problem or you can select.

Please prepare slides. Provide a 1-2 page write up by Dec 11 via email only.

 

For slides and report use this structure: Back ground describing the problem, objective, state and its variables, action variables, uncertainty and how the state trasitions to another state, reward/penatly - contribution function, transition probability, reference etc.

 

********************************************************************************