Papers
- Deep reinforcement learning from human preferences
- Direct Preference Optimization Your Language Model is Secretly a Reward Model - DPO
- Group Robust Preference Optimization in Reward-free RLHF - GRPO
- Illustrating Reinforcement Learning from Human Feedback (RLHF)
- Is Feedback All You Need Leveraging Natural Language Feedback in Goal-Conditioned Reinforcement Learning
- Learning to summarize from human feedback
- Playing Atari with Deep Reinforcement Learning
- Scaling Laws for Reward Model Overoptimization
- Training language models to follow instructions with human feedback - InstructGPT
Surveys and Reviews
Implementation
- OpenRLHF docs - not very clean or readable according to Duarte; Patrick says Ray is a pain in the ass (authors present at RayCon so def. using Ray)
- Gymnasium Documentation and Gymnasium repo - An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)
- a maintained fork of OpenAIâs Gym library
Resources
- Reinforcement Learning An Overview - RL tutorial by Kevin Murphy
- Reinforcement Learning (Data Driven Science & Engineering - Chapter 11) by Steven L. Brunton and J. Nathan Kutz
- Foundations of Deep RL lecture series by Pieter Abbeel
- Policy Gradient Algorithms - post by Lilian Weng
- âš Reinforcement Learning Lecture Series by by Steve Brunton - follows the book chapter above
- DeepMind x UCL | Introduction to Reinforcement Learning 2015 - 10 instalment lecture series by the one and only David Silver
- Reinforcement Learning: An Introduction (2015) by Richard S. Sutton and Andrew G. Barto
- OpenAI Spinning Up
- MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman - link
- Deep Reinforcement Learning Course - HF đ€ course
- backed by the Hugging Face Deep Reinforcement Learning Course repo
- chapter 1 âclippedâ is from an older version: An Introduction to Deep Reinforcement Learning
- An Intuitive Explanation of Policy GradientâââPart 1 REINFORCE