11. Reinforcement Learning
Learning Through Consequences
Reinforcement Learning (RL) is the paradigm where an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards and penalties. It is fundamentally different from supervised and unsupervised learning—there is no dataset of correct answers. The agent must discover what works through trial and error, balancing the exploration of new strategies against the exploitation of known good ones. RL is what enabled AlphaGo to defeat the world's best human Go player, what powers robotic manipulation systems, and what aligned ChatGPT with human preferences through RLHF.
🎮 The RL Framework — Agent, Environment, Reward
Every RL problem is defined by the same components:
- Agent: The decision-maker that observes the environment and takes actions. The neural network being trained.
- Environment: Everything the agent interacts with. Can be a game simulator, a physics engine, a real robot, or a language model feedback mechanism.
- State (s): The current observation of the environment. What the agent sees at time t. In a chess game: the current board position. In robotics: joint angles and velocities. In LLM fine-tuning: the conversation so far.
- Action (a): What the agent does. In a chess game: which piece to move where. In robotics: motor torques. In LLM fine-tuning: the next token to generate.
- Reward (r): A scalar signal indicating how good the last action was. The agent's objective is to maximize total cumulative reward over time, not just the immediate reward.
- Policy (π): The agent's strategy—a function mapping states to actions (or distributions over actions). This is what we're training.
- Value Function V(s): The expected cumulative reward from state s following policy π. Tells the agent "how good is it to be in this state?"
- Q-Function Q(s,a): Expected cumulative reward from taking action a in state s. Tells the agent "how good is this specific action in this state?"
🔗 Markov Decision Processes (MDPs)
The formal mathematical framework for RL is the Markov Decision Process. An MDP is defined by: States (S), Actions (A), Transition function P(s'|s,a) (probability of landing in state s' after taking action a in state s), and Reward function R(s,a). The Markov Property assumes the next state depends only on the current state and action—not the history. This assumption is violated in partial observability (you don't see the full state) but remains a useful approximation for most practical RL.
The RL Goal — Maximum Cumulative Reward: The agent maximizes expected return G_t = r_t + γ·r_{t+1} + γ²·r_{t+2} + ... where γ (gamma, discount factor, 0 to 1) determines how much the agent values future rewards versus immediate rewards. γ close to 1: agent plans far into the future. γ close to 0: agent focuses on immediate reward.
🧮 Q-Learning — Learning Without a Model
Q-Learning learns the Q-function (state-action value function) directly from experience without modeling the environment. It is a model-free, off-policy algorithm—the most widely studied RL algorithm and the foundation of Deep Q-Networks.
The Bellman Equation: Q(s,a) = E[r + γ·max_{a'} Q(s', a')]. The Q-value for a state-action pair equals the immediate reward plus the discounted best Q-value achievable from the next state. Q-Learning updates Q-values iteratively toward this target until convergence.
Exploration vs. Exploitation Dilemma: The ε-greedy policy addresses the fundamental tension: exploit known good actions (maximize current reward) or explore new actions (might find better rewards). Start with high ε (mostly explore), decay ε over training (increasingly exploit). Balancing this is one of the core challenges in RL.
🧠 Deep Q-Networks (DQN)
For large state spaces (like Atari game screens with 84×84 pixels), a Q-table is intractable. DQN replaces the table with a neural network that approximates Q(s,a) for all actions simultaneously given the state.
DQN Key Innovations:
- Experience Replay: Store transitions (s,a,r,s') in a replay buffer. Sample random mini-batches for training. Breaks temporal correlations between consecutive transitions—stabilizes training.
- Target Network: Maintain a second, identical network whose weights are updated less frequently. Used to compute the Q-learning target. Prevents the "chasing a moving target" instability where both the current Q estimates and the targets update simultaneously.
🤖 RL in the Real World
- Game Playing: DeepMind's AlphaGo (2016) defeated world champion Lee Sedol using RL + Monte Carlo Tree Search. AlphaZero (2017) learned Chess, Shogi, and Go from scratch in 24 hours through self-play, surpassing all previous AI in all three games simultaneously.
- Robotics: OpenAI's Dactyl trained a robotic hand to solve a Rubik's Cube using RL in simulation (domain randomization) and transferred the policy to real hardware. Boston Dynamics' locomotion controllers use RL.
- LLM Alignment (RLHF): InstructGPT, ChatGPT, and Claude were fine-tuned using RL with human preference data as the reward signal—perhaps the most impactful current application of RL.
- Recommendation Systems: YouTube, TikTok, and Spotify use RL to optimize long-term engagement, not just immediate click-through rates.
Knowledge Check
Ready to test your understanding of 11. Reinforcement Learning?