Reinforcement Learning

2nd year
Programme main editor:
Onsite in:
ECTS range:
5-7 ECTS


Francesco De Pellegrini
Laura Dioşan


Students are required to have taken an introductory machine learning course.

Good knowledge on probability and statistics is expected.

Bases on Markov Chains are recommended, but this is not a prerequisite.

Pedagogical objectives:

This course provides an overview of reinforcement learning (RL) methods. Both theoretical and programming aspects will be extensively explored in this course in order to acquire a solid expertise on both. By the end of the course, students should:

  • Understand the notion of stochastic approximations and their relation with RL;
  • Understand the basis of Markov decision theory;
  • Apply Dynamic Programming methods to solve the Bellman equations;
  • Master the basic techniques of Reinforcement Learning: Monte Carlo, Time-difference and Policy Gradient;
  • Study a proof of convergence for RL algorithms;
  • Master more advanced techniques such as actor-critic methods and deep RL.

Evaluation modalities:

Final exam, lab and research project reports.

All students in the class will also conduct a research project in the field of reinforcement learning and write a short 5-page paper. Subjects will be provided during the first-class session, related to Constrained RL and Delayed RL.


This course will introduce machine learning techniques based on stochastic approximations and MDP models, i.e., SARSA, Q-learning, policy gradient. Two homework assignments will focus on implementing these techniques, in order to learn how to master them by direct implementation. A project in teams of 2/3 students will permit to address more advanced techniques and problems in the field of RL and more in general the application of Markov theory for modeling and optimization.


  • Course Overview. Introduction to Markov decision theory,  stochastic approximations, and reinforcement learning;
  • Stochastic approximations: the Robbins-Monro algorithm;
  • Criteria for convergence;
  • Application to admission control problems;
  • Markov decision processes: definitions, average cost and discounted cost;
  • Bellman equations. Solutions based on Dynamic Programming;
  • Monte Carlo methods for Reinforcement Learning;
  • Time Difference methods: SARSA and Q-Learning;
  • Proof of convergence of Q-Learning;
  • Policy gradient: REINFORCE;
  • Actor-critic methods;
  • Multi-armed bandits;
  • Deep-reinforcement Learning.

Lab assignments:

  • Practice of stochastic approximation on a traffics admission problem;
  • Practice of Montecarlo, Q-learning and SARSA on gridworld (discounted cost);
  • Practice of buffer management with admission control (average cost).

Required teaching material

Bibliography: • Artificial Intelligence: A modern approach, S. Russell and P. Norvig, Prentice Hall, 3rd edition, 2010. • Reinforcement Learning: An Introduction, R. S. Sutton and A. G. Barto, MIT Press, 1992

Teaching volume:
28-42 hours
Supervised lab:
0-28 hours
0-3 hours


  • Laboratory-Based Course Structure
  • Open-Source Software Requirements