Reinforcement Learning
Theory and Implementation in a Custom Environment
#
you can find the thesis here and the code here
Abstract #
Reinforcement Learning (RL) is a subcategory of Machine Learning that consis- tently surpasses human performance and demonstrates superhuman understand- ing in various environments and datasets. Its applications span from master- ing games like Go and Chess to optimizing real-world operations in robotics, fi- nance, and healthcare. The adaptability and efficiency of RL algorithms in dynamic and complex scenarios highlight their transformative potential across multiple do- mains.
In this thesis, we present some core concepts of Reinforcement Learning.
First, we introduce the mathematical foundation of Reinforcement Learning (RL) through Markov Decision Processes (MDPs), which provide a formal frame- work for modeling decision-making problems where outcomes are partly random and partly under the control of a decision-maker, involving state transitions influ- enced by actions. Then, we give an overview of the two main branches of Rein- forcement Learning: value-based methods, which focus on estimating the value of states or state-action pairs, and policy-based methods, which directly optimize the policy that dictates the agent’s actions.
We focus on Proximal Policy Optimization (PPO), which is the de facto baseline algorithm in modern RL literature due to its robustness and ease of implementa- tion, and discuss its potential advantages, such as improved sample efficiency and stability, as well as its disadvantages, including sensitivity to hyper-parameters and computational overhead. We emphasize the importance of fine-tuning PPO to achieve optimal performance.
We demonstrate the application of these concepts within Pneuma, a custom- made environment specifically designed for this thesis. Pneuma aims to become a research base for independent Multi-Agent Reinforcement Learning (MARL), where multiple agents learn and interact within the same environment. We outline the requirements for such environments to support MARL effectively and detail the modifications we made to the baseline PPO method, as presented by OpenAI, to facilitate agent convergence for a single-agent level.
Finally, we discuss the potential for future enhancements to the Pneuma envi- ronment to increase its complexity and realism, aiming to create a more RPG-like setting, optimal for training agents in complex, multi-objective, and multi-step tasks.