Chapter 8: Policy Gradients --- Exercises & Labs¶
Companion to Chapter 8: Policy Gradient Methods
This document provides detailed solutions, hints, and experimental protocols for the exercises and labs in Chapter 8.
Part A: Mathematical Exercises¶
Exercise 8.1: Baseline Variance Reduction¶
Problem: Derive the state-dependent baseline \(b^*(s)\) that minimizes the variance of the policy gradient estimator at state \(s\), and explain how it relates to the commonly used value-function baseline \(V^\pi(s)\).
Solution:
Consider the gradient estimator at a single state \(s\) (we drop time subscripts for clarity):
where \(a \sim \pi_\theta(\cdot|s)\). We want to minimize the variance:
From [THM-8.2.2], we know \(\mathbb{E}_{a \sim \pi}[\hat{g}] = \nabla_\theta J(\theta)\) regardless of \(b(s)\) (constant in expectation). Thus minimizing variance is equivalent to minimizing:
Define \(g(a) = \nabla_\theta \log \pi(a|s)\) and \(Q(a) = Q^\pi(s,a)\) for brevity. We minimize:
Take the derivative w.r.t. \(b\):
Set to zero:
Solve for \(b\):
This is the unique minimizer of \(\text{Var}[\hat{g}]\); it is a weighted average of \(Q(a)\) with weights proportional to the squared score function. The familiar value-function baseline \(V^\pi(s) = \mathbb{E}_{a}[Q^\pi(s,a)]\) drops the weights. It is optimal under the additional assumption that \(\|g(a)\|\) is approximately constant in \(a\) (true for small action perturbations or isotropic Gaussians near the mean), and it remains a powerful heuristic because \(V^\pi\) is much easier to estimate than the weighted ratio above.
Optional check. Show that if \((\nabla_\theta \log \pi(a|s))^2 = c(s)\) is constant in \(a\), then \(b^*(s) = V^\pi(s)\).
Numerical Verification (Lab Extension):
Implement multiple baselines and measure gradient variance empirically:
# In zoosim/policies/reinforce.py, modify update() to test:
# 1. b = 0 (no baseline)
# 2. b = mean(returns) (constant)
# 3. b = V_network(s_t) (learned value function)
# Track: gradient_norms = []
# Plot variance across episodes
Exercise 8.2: Entropy and Boltzmann Policy¶
Problem: Show that maximizing entropy \(H(\pi)\) subject to expected reward \(\mathbb{E}_{a \sim \pi}[Q(a)] \geq \bar{R}\) yields Boltzmann policy \(\pi(a) \propto \exp(\beta Q(a))\).
Solution:
We solve the constrained optimization problem:
Form the Lagrangian:
Take the derivative w.r.t. \(\pi(a)\):
Set to zero (first-order condition):
Exponentiate:
where \(C = \exp(\mu - 1)\) is determined by the normalization constraint \(\sum_a \pi(a) = 1\):
The Lagrange multiplier \(\lambda\) is the inverse temperature (often denoted \(\beta\)). High \(\lambda\) (low temperature) -> deterministic policy (argmax). Low \(\lambda\) (high temperature) -> uniform policy (maximum entropy).
Connection to Maximum Entropy RL: SAC [haarnoja:soft_actor_critic:2018] maximizes \(J(\pi) = \mathbb{E}[R] + \alpha H(\pi)\), which is the unconstrained version of this problem. The optimal policy is softmax w.r.t. the "soft" Q-function \(Q_{\text{soft}}(s,a)\).
Exercise 8.3: REINFORCE Convergence¶
Problem: Show that [ALG-8.1] converges to a local optimum under standard stochastic approximation conditions.
Solution (Sketch):
We apply the Robbins-Monro theorem [robbins:stochastic_approximation:1951] for stochastic gradient ascent. The update is:
where \(\hat{g}_k = \sum_t \nabla_\theta \log \pi_{\theta_k}(a_t|s_t) (G_t - b)\) is the gradient estimate at episode \(k\).
Conditions (Robbins-Monro): 1. \(\mathbb{E}[\hat{g}_k \mid \theta_k] = \nabla_\theta J(\theta_k)\) (unbiased) --- Proven in [THM-8.2] 2. \(\text{Var}[\hat{g}_k \mid \theta_k] < \infty\) --- Requires bounded rewards and policy smoothness 3. Learning rate schedule: \(\sum_k \alpha_k = \infty\) and \(\sum_k \alpha_k^2 < \infty\) (e.g., \(\alpha_k = 1/k\))
Under these conditions, \(\theta_k\) converges to a stationary point \(\theta^*\) where \(\nabla_\theta J(\theta^*) = 0\) with probability 1.
Local vs. Global Optimum: Since \(J(\theta)\) is generally nonconvex (neural network policies have many local optima), we only guarantee convergence to a local optimum. However, for certain policy classes (e.g., LQR with linear policy), all local optima are global [fazel:global_convergence:2018].
Practical Note: Modern implementations use constant learning rate \(\alpha\) (e.g., \(3 \times 10^{-4}\)) with early stopping, not diminishing \(\alpha_k\). This trades asymptotic convergence for faster practical learning.
Reference: [sutton:policy_gradient:2000, Theorem 2] provides a complete proof for the softmax policy case.
Exercise 8.4: Policy Gradient for LQR¶
Problem: Implement policy gradient for the 1D discounted LQR system and verify convergence toward the analytic gain obtained from the discounted algebraic Riccati equation (DARE).
Analytical reference. For dynamics \(s_{t+1} = s_t + a_t\), stage cost \(c_t = s_t^2 + a_t^2\), and discount \(\gamma = 0.99\), the DARE
with \(A=B=Q=R=1\) yields the positive root \(P \approx 1.62\) and feedback gain
Script (requires scipy) to compute the reference:
import numpy as np
from scipy.linalg import solve_discrete_are
def discounted_lqr_gain(gamma: float = 0.99) -> float:
A = np.array([[1.0]])
B = np.array([[1.0]])
Q = np.array([[1.0]])
R = np.array([[1.0]])
A_bar = np.sqrt(gamma) * A
B_bar = np.sqrt(gamma) * B
P = solve_discrete_are(A_bar, B_bar, Q, R)
gain = -(gamma * B.T @ P @ A) / (R + gamma * B.T @ P @ B)
return gain.item()
print(f"K* = {discounted_lqr_gain():.4f}")
Policy gradient experiment.
"""1D discounted LQR via REINFORCE.
System: s_{t+1} = s_t + a_t
Reward: r_t = -(s_t^2 + a_t^2)
Policy: pi_theta(a|s) = N(theta * s, sigma^2)
Target: theta* ~= -0.615 (DARE)
"""
import numpy as np
import matplotlib.pyplot as plt
def lqr_policy_gradient(
num_episodes: int = 500,
horizon: int = 20,
learning_rate: float = 0.05,
sigma: float = 0.5,
gamma: float = 0.99,
seed: int = 42,
target_gain: float = -0.615,
):
rng = np.random.default_rng(seed)
theta = 0.0
theta_history = [theta]
for episode in range(num_episodes):
s = 1.0
states, actions, rewards = [], [], []
for _ in range(horizon):
mu = theta * s
a = mu + sigma * rng.normal()
r = -(s**2 + a**2)
s_next = s + a
states.append(s)
actions.append(a)
rewards.append(r)
s = s_next
returns = []
G = 0.0
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
grad = 0.0
for s_t, a_t, G_t in zip(states, actions, returns):
score = (a_t - theta * s_t) * s_t / (sigma**2)
grad += score * G_t
theta += learning_rate * grad / horizon
theta_history.append(theta)
if episode % 50 == 0:
avg_return = np.mean(returns)
err = abs(theta - target_gain)
print(f"Ep {episode:3d} | theta={theta:6.3f} | Return={avg_return:7.2f} | |theta-K*|={err:6.3f}")
return theta, theta_history
theta_star = -0.615 # from discounted_lqr_gain()
theta_final, history = lqr_policy_gradient(target_gain=theta_star)
print(f"\nFinal theta = {theta_final:.4f}")
print(f"Analytical theta* = {theta_star:.4f}")
print(f"|theta - theta*| = {abs(theta_final - theta_star):.4f}")
plt.figure(figsize=(10, 4))
plt.plot(history, label='theta (policy gradient)')
plt.axhline(theta_star, color='red', linestyle='--', label='theta* (DARE)')
plt.xlabel('Episode')
plt.ylabel('Policy Parameter theta')
plt.title('Policy Gradient vs. DARE Solution')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('lqr_policy_gradient.png', dpi=150)
print("Plot saved: lqr_policy_gradient.png")
Expected Output:
Ep 0 | theta=-0.022 | Return=-27.91 | |theta-K*|= 0.593
Ep 50 | theta=-0.392 | Return=-22.83 | |theta-K*|= 0.223
Ep 100 | theta=-0.541 | Return=-21.10 | |theta-K*|= 0.074
Ep 150 | theta=-0.594 | Return=-20.58 | |theta-K*|= 0.021
Ep 200 | theta=-0.611 | Return=-20.45 | |theta-K*|= 0.004
...
Final theta = -0.6148
Analytical theta* = -0.6150
|theta - theta*| = 0.0002
Analysis. - REINFORCE converges to the discounted Riccati gain within numerical tolerance. - The DARE baseline provides an exact target for debugging policy gradient implementations. - Because discounted LQR is convex in the policy parameter, every stationary point is global [@fazel:global_convergence:2018].
Extension: Extend to \(s \in \mathbb{R}^2\) with matrix gains \(K \in \mathbb{R}^{2 \times 2}\) and verify convergence to the DARE solution on each axis.
Status --- Advanced Labs
Lab 8.1 (REINFORCE with rich features) and Lab 8.2 (Deep end-to-end REINFORCE) are fully wired to existing scripts under scripts/ch08/. Lab 8.3 (Gaussian policy ablations) and Lab 8.4 (RLHF simulation) now ship with concrete demo scripts (scripts/ch08/policy_ablation.py and scripts/ch08/rlhf_demo.py), but remain advanced: readers are encouraged to treat them as starting points for their own ablation pipelines and RLHF variants rather than fixed production code.
Part B: Computational Labs¶
Lab 8.1: REINFORCE with Rich Features¶
Objective: Reproduce Chapter 8 results and analyze hyperparameter sensitivity.
Protocol¶
Step 1: Baseline Run
cd /path/to/rl_search_from_scratch
# Ensure environment is set up
source .venv/bin/activate
pip install -e .
# Run REINFORCE with rich features
python scripts/ch08/reinforce_demo.py \
--episodes 3000 \
--lr 3e-4 \
--gamma 0.99 \
--entropy 0.01 \
--a_max 5.0 \
--rich-features \
--seed 2025
Expected Output:
Chapter 8: REINFORCE on Zooplus Search
Episodes: 3000 | Seed: 2025
Features: Rich (Dim 17)
Action Limit: 5.0 | LR: 0.0003
Observation Dim: 17
Action Dim: 10 (Continuous)
Ep 100 | Return: 9.12 | Entropy: 2.47 | Loss: -31.52
Ep 200 | Return: 9.87 | Entropy: 2.31 | Loss: -34.21
...
Ep 2900 | Return: 11.43 | Entropy: 1.98 | Loss: -42.67
Ep 3000 | Return: 11.56 | Entropy: 1.95 | Loss: -43.12
Final 100-ep Average Return: 11.56
Improvement: 9.02 -> 11.56
Agent learned to improve policy.
Step 2: Ablation --- No Entropy Regularization
python scripts/ch08/reinforce_demo.py \
--episodes 3000 \
--entropy 0.0 \
--seed 2025
Observation: Policy entropy collapses to near-zero by episode 500, final return ~9.0 (worse than baseline). This demonstrates necessity of entropy regularization.
Step 3: Ablation --- Learning Rate Sweep
Run with \(\alpha \in \{1e-5, 1e-4, 3e-4, 1e-3, 3e-3\}\):
for lr in 1e-5 1e-4 3e-4 1e-3 3e-3; do
python scripts/ch08/reinforce_demo.py \
--episodes 3000 \
--lr $lr \
--seed 2025 \
> results_lr_${lr}.txt
done
Expected Findings: - \(\alpha = 1e-5\): Too slow, doesn't converge - \(\alpha = 1e-4\): Converges to ~10.8 - \(\alpha = 3e-4\): Optimal, converges to ~11.6 - \(\alpha = 1e-3\): Slightly unstable, ~11.2 - \(\alpha = 3e-3\): Diverges (gradient steps too large)
Step 4: Comparison to Q-Learning
Run Chapter 7 Q-learning for comparison:
python scripts/ch07/continuous_actions_demo.py \
--n-episodes 3000 \
--seed 2025
Expected: Q-learning achieves ~25.0 (2.2x better than REINFORCE).
Analysis Questions: 1. Why does Q-learning outperform REINFORCE by 2x? 2. Plot variance of returns: REINFORCE vs. Q-learning. Which is noisier? 3. Compute sample efficiency: episodes to reach return 10.0. Q-learning ~300, REINFORCE ~1500.
Deliverables¶
Plots: 1. Learning curves: return vs. episode (3 seeds, mean +/- std) 2. Entropy trajectory: \(H(\pi_\theta)\) vs. episode 3. Gradient norm: \(\|\nabla J\|\) vs. episode
Table:
| Hyperparameter | Value | Final Return | Convergence Episode |
|---|---|---|---|
| \(\alpha = 3e-4\) | 0.01 | 11.56 | ~2500 |
| \(\alpha = 3e-4\), no entropy | 0 | 9.02 | ~1200 (collapsed) |
| \(\alpha = 1e-3\) | 0.01 | 11.18 | ~2200 |
Lab 8.1b: The Variance Reduction Challenge¶
Objective: Empirically verify the benefit of a learned baseline in a stochastic environment.
This lab directly implements the "Failure -> Diagnosis -> Fix" narrative from Section 8.5: we run Vanilla REINFORCE (Act I), observe the variance problem, then run REINFORCE + Baseline (Act III) to see the fix in action.
Protocol¶
Step 1: Run Vanilla REINFORCE
python scripts/ch08/reinforce_demo.py \
--episodes 5000 \
--seed 2025 \
--lr 3e-4 \
--entropy 0.01 \
--a_max 5.0 \
--rich-features
What to observe: - Jagged loss curves with large spikes - "Forgetting" --- performance spikes up then suddenly drops - Oscillating returns between 3-17, never stabilizing - Final 100-episode average: ~7.99 (worse than random baseline 8.7!)
Step 2: Run REINFORCE with Learned Baseline
python scripts/ch08/reinforce_baseline_demo.py \
--episodes 5000 \
--seed 2025 \
--lr 3e-4 \
--value-lr 1e-3 \
--entropy 0.01 \
--a_max 5.0 \
--rich-features
What to observe: - Smoother entropy decay - Monotonic improvement in returns - Value loss decreases from 100s to near-zero (critic learning \(V^\pi(s)\)) - Final 100-episode average: 13.19 (65% improvement!)
Expected Results¶
| Method | Architecture | Final Return | Convergence | Stability |
|---|---|---|---|---|
| Vanilla REINFORCE | Single policy net | ~7.99 | Never (tested to 50k episodes) | High variance, oscillates |
| REINFORCE + Baseline | Policy + Value nets | ~13.19 | ~5,000 episodes | Stable, monotonic improvement |
| Improvement | - | +65% | 10x faster | - |
Analysis Tasks¶
Task 1: Overlay Learning Curves
Plot return vs. episode for both methods on the same axes. Annotate: - Episode 500: Baseline stabilizes (~12), Vanilla still oscillating (3-15) - Episode 5000: Baseline converges (13.19), Vanilla unstable (7.99)
Task 2: Entropy Trajectory
Verify both methods maintain exploration (entropy ~14-15 throughout). This confirms the improvement is pure variance reduction, not an exploration/exploitation artifact.
Task 3: Advantage Distribution
Create histograms comparing: - Vanilla: \((G_t - b)\) where \(b\) is scalar EMA baseline - Baseline: \((G_t - V_\phi(s_t))\) where \(V_\phi\) is learned value network
You should see: - Vanilla distribution: Bimodal (mode at -10 for no-purchase episodes, scattered +30 to +50 for purchases) - Baseline distribution: Symmetric around 0, tighter std (~7 vs ~15 for vanilla)
Task 4: Variance Quantification
Compute gradient variance empirically:
# During training, collect:
gradients = [] # List of gradient norms per episode
# Then compute:
gradient_variance = np.var(gradients)
Expected: Baseline reduces gradient variance by ~50% compared to vanilla.
Key Insight¶
The learned value baseline \(V_\phi(s)\) removes state-dependent variance that a scalar baseline cannot. This is [THM-8.5.2] in action:
- States with high expected return (e.g., price_sensitive user on budget query): \(V(s) \approx 15\)
- States with low expected return (e.g., random user on generic query): \(V(s) \approx 5\)
- Scalar baseline: \(b = 10\) (global average, wrong for both states!)
By adapting the baseline to each state's intrinsic value, we eliminate the noise from "environmental luck" and allow the policy to learn from genuine action quality.
Deliverables¶
Plots: 1. Learning curves (overlay): return vs. episode for vanilla and baseline 2. Entropy trajectories: verify exploration maintained in both 3. Value loss curve (baseline only): should decrease from \(10^2\) to \(<5\) 4. Advantage distribution histograms: \((G_t - b)\) vs \((G_t - V(s))\)
Table:
| Metric | Vanilla REINFORCE | REINFORCE + Baseline | Improvement |
|---|---|---|---|
| Final Return | 7.99 | 13.19 | +65% |
| Convergence Episodes | Never (50k+) | ~5,000 | 10x faster |
| Gradient Std Dev | ~15 | ~7 | 50% reduction |
| Stability | Oscillates | Monotonic | Qualitative |
Lab 8.2: Deep End-to-End REINFORCE (Failure Analysis)¶
Objective: Understand why end-to-end learning from raw features fails.
Protocol¶
Step 1: Reproduce Failure
python scripts/ch08/neural_reinforce_demo.py \
--episodes 5000 \
--lr 3e-4 \
--hidden-dim 128 \
--device auto \
--seed 2025
Expected Output:
Deep REINFORCE (End-to-End Learning)
Device: CPU (or CUDA/MPS if available)
Episodes: 5000 | Seed: 2025
Network: Raw Input -> [128, 128] -> Action
Observation Dim: 8 (Raw Context)
Action Dim: 10
Ep 200 | Return: 7.23 | Loss: -28.91 | 45.2s
Ep 400 | Return: 6.87 | Loss: -26.34 | 91.8s
...
Ep 4800 | Return: 5.91 | Loss: -22.15 | 1103.2s
Ep 5000 | Return: 5.93 | Loss: -22.08 | 1150.1s
Final Average Return: 5.93
---
Contextual Baselines (Approximate):
Random Agent: ~8.7
Rich Features (Ch6/8): ~10.5 - 11.5
Continuous Q (Ch7): ~25.0
---
NOTE: Training unstable or insufficient. Deep RL from scratch is hard!
Observation: Final return 5.93 is worse than random (8.7). The network failed to learn useful features.
Step 2: Increase Training Budget
python scripts/ch08/neural_reinforce_demo.py \
--episodes 10000 \
--seed 2025
Result: Performance improves to ~7.5 by episode 10k, but still below random. Simply training longer doesn't solve the problem.
Step 3: Add Supervised Pretraining
Modify neural_reinforce_demo.py to pretrain feature layers on supervised task:
# Pretrain feature extractor on category prediction
# Input: user segment + query type -> Output: clicked category (classification)
# Load logged data from Chapter 6
logged_data = load_logged_trajectories() # (s, a, r, clicked_cat)
# Train feature net
feature_net = GaussianPolicy.net # Extract feature layers
optimizer = Adam(feature_net.parameters(), lr=1e-3)
for epoch in range(10):
for batch in logged_data:
s, _, _, cat = batch
pred_cat = feature_net(s) # Add classification head
loss = cross_entropy(pred_cat, cat)
loss.backward()
optimizer.step()
# Then initialize REINFORCE policy with pretrained features
Expected: Pretraining improves final return to ~9.5 (better, but still below rich features).
Step 4: Comparison to Neural Linear Bandit (Chapter 6A)
Why did neural features work in Chapter 6A but fail here?
# Run Ch6A neural linear bandit
python scripts/ch06a/neural_linear_demo.py --episodes 3000 --seed 2025
# Expected: ~10.0 return (success!)
Analysis: Neural linear learns a value function (dense reward signal every step). REINFORCE learns from episodic returns (sparse signal). Value-based learning is more sample-efficient for feature learning.
Deliverables¶
Report: 1. Plot: Return vs. episode for raw REINFORCE, pretrained REINFORCE, rich-feature REINFORCE 2. Gradient flow analysis: Compute \(\|\nabla \theta_{\text{features}}\|\) (feature gradient norm) vs. \(\|\nabla \theta_{\text{policy head}}\|\). Show that feature gradients vanish. 3. Recommendation: Use Actor-Critic (Chapter 12) or pretrained representations for end-to-end learning.
Lab 8.3: Gaussian Policy Ablations¶
Objective: Compare action distribution parameterizations.
Turnkey script: scripts/ch08/policy_ablation.py implements the ablation harness used in this lab; you can run it directly with --policy {gaussian,beta,squashed,fixed_std} as shown below.
Implementations¶
Modify zoosim/policies/reinforce.py:
1. Beta Distribution Policy
from torch.distributions import Beta
class BetaPolicy(nn.Module):
"""Beta policy with support [0, 1], scaled to [-a_max, +a_max]."""
def __init__(self, obs_dim: int, action_dim: int, hidden_sizes: Tuple[int, ...]):
super().__init__()
self.net = build_mlp(obs_dim, hidden_sizes)
self.alpha_head = nn.Linear(hidden_sizes[-1], action_dim)
self.beta_head = nn.Linear(hidden_sizes[-1], action_dim)
def forward(self, obs: torch.Tensor) -> Distribution:
x = self.net(obs)
alpha = torch.exp(self.alpha_head(x)) + 1.0 # alpha > 1 for unimodal
beta = torch.exp(self.beta_head(x)) + 1.0 # beta > 1 for unimodal
return Beta(alpha, beta)
def sample_action(self, obs: torch.Tensor, a_max: float) -> torch.Tensor:
dist = self.forward(obs)
# Beta samples in [0, 1], scale to [-a_max, +a_max]
a_01 = dist.sample()
return 2 * a_max * (a_01 - 0.5)
2. Squashed Gaussian (Tanh)
class SquashedGaussianPolicy(nn.Module):
"""Gaussian with tanh squashing (SAC-style)."""
def forward(self, obs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
mu = self.mu_head(self.net(obs))
std = torch.exp(torch.clamp(self.log_std, -2, 1))
# Sample from unbounded Gaussian
z = mu + std * torch.randn_like(mu)
# Squash to [-1, 1] via tanh
action = torch.tanh(z)
# Compute log-prob with change-of-variables
# log pi(a) = log pi(z) - log |da/dz|
# |da/dz| = 1 - tanh^2(z)
log_prob = Normal(mu, std).log_prob(z).sum(dim=-1)
log_prob -= torch.log(1 - action.pow(2) + 1e-6).sum(dim=-1)
return action, log_prob
3. Fixed Std Gaussian
class FixedStdGaussianPolicy(nn.Module):
"""Learn mean only, fix std=1."""
def __init__(self, obs_dim, action_dim, hidden_sizes, fixed_std=1.0):
super().__init__()
self.net = build_mlp(obs_dim, hidden_sizes)
self.mu_head = nn.Linear(hidden_sizes[-1], action_dim)
self.fixed_std = fixed_std
def forward(self, obs):
mu = self.mu_head(self.net(obs))
std = torch.full_like(mu, self.fixed_std)
return Normal(mu, std)
Experiments¶
Run each variant for 3000 episodes, 3 seeds:
for policy in gaussian beta squashed fixed_std; do
for seed in 2025 2026 2027; do
python scripts/ch08/policy_ablation.py \
--policy $policy \
--episodes 3000 \
--seed $seed \
> results_${policy}_seed${seed}.txt
done
done
Compare:
| Policy | Mean Return | Std Dev | Entropy (final) | Training Stability |
|---|---|---|---|---|
| Gaussian | 11.56 | 0.42 | 1.95 | Stable |
| Beta | 10.87 | 0.68 | 1.72 | Mode collapse |
| Squashed | 11.34 | 0.51 | 1.88 | Stable |
| Fixed Std | 10.21 | 0.73 | 2.30 (const) | Underexplores late |
Findings: - Gaussian: Best performance, standard choice - Beta: Natural bounds but harder to optimize (mode collapse to boundary) - Squashed: Nearly as good as Gaussian, better if strict bounds needed (SAC) - Fixed Std: Can't adapt exploration, underperforms
Lab 8.4: RLHF Simulation (Advanced)¶
Objective: Simulate the InstructGPT/RLHF pipeline for search ranking.
We do not introduce new core theory here. Instead, we compose three ingredients you already know—(i) a supervised Gaussian policy (Chapter 8 + SFT), (ii) a learned scalar reward model from pairwise preferences (Bradley–Terry), and (iii) a REINFORCE-style policy gradient with a KL regularizer—to build an InstructGPT-style pipeline in the Zoosim search environment.
Big Picture — What's Reused vs New?
Reused from earlier chapters and labs - Gaussian policy \(\pi_\theta(a|s)\) with entropy regularization (Chapter 8, Lab 8.1). - Policy gradient / REINFORCE update with log-prob weights (Chapter 8, [ALG-8.1]). - KL divergence as a regularizer (previewed in PPO Section 8.7.2; used here to keep the RL policy close to the SFT policy).
New in this lab (structural additions) - Trajectory-level reward model \(r_\phi(\tau)\) trained from pairwise preferences via a Bradley–Terry objective. - Three-phase workflow: SFT -> reward modeling -> RL fine-tuning with KL penalty, mirroring the InstructGPT RLHF pipeline. - Simple "expert" template policy in Zoosim used to generate demonstration data and preference pairs (standing in for human raters).
Turnkey demo: scripts/ch08/rlhf_demo.py provides a compact, runnable version of the three-phase RLHF-style pipeline described here (SFT -> reward modeling -> KL-regularized REINFORCE on the learned reward).
Three-Phase Protocol¶
Phase 1: Supervised Fine-Tuning (SFT)
Train policy via imitation learning on expert demonstrations:
# Load expert trajectories (discrete templates from Ch6)
expert_data = load_expert_trajectories() # (s, a_expert, r)
# Train policy to match expert actions
policy = GaussianPolicy(obs_dim, action_dim, hidden_sizes)
optimizer = Adam(policy.parameters(), lr=1e-3)
for epoch in range(50):
for batch in expert_data:
s, a_expert = batch['state'], batch['action']
# Supervised loss: negative log-likelihood
dist = policy(s)
loss = -dist.log_prob(a_expert).mean()
loss.backward()
optimizer.step()
# This policy pi_SFT mimics expert behavior
Phase 2: Reward Modeling (Bradley-Terry)
Learn reward function from pairwise comparisons:
# Minimal setup
import torch
import torch.nn.functional as F
from torch.optim import Adam
# Generate trajectory pairs and get human preferences
# (In practice, this would be A/B test data or crowdsourced labels)
trajectories = generate_trajectory_pairs(policy_sft, env, num_pairs=1000)
# For each pair (tau_1, tau_2), ask: "Which ranking is better?"
# Simulate with ground-truth GMV: prefer higher GMV
preferences = []
for tau1, tau2 in trajectories:
if GMV(tau1) > GMV(tau2):
preferences.append((tau1, tau2, torch.tensor(1.0))) # tau1 preferred -> label 1
else:
preferences.append((tau1, tau2, torch.tensor(0.0))) # tau2 preferred -> label 0
# Train reward model: r_phi(tau) such that P(tau1 > tau2) = sigmoid(r_phi(tau1) - r_phi(tau2))
reward_model = RewardMLP(state_dim, hidden=128)
optimizer = Adam(reward_model.parameters(), lr=1e-3)
for epoch in range(20):
for tau1, tau2, label in preferences:
r1 = reward_model(tau1) # Aggregate trajectory features
r2 = reward_model(tau2)
# Bradley-Terry loss: logit = r1 - r2, target = 1 if tau1 preferred
logit = r1 - r2
loss = F.binary_cross_entropy_with_logits(logit, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
The target label equals 1 when \(\tau_1 \succ \tau_2\) so that \(\sigma(r_\phi(\tau_1) - r_\phi(\tau_2))\) approximates $\mathbb{P}(\tau_1 \text{ preferred})`, matching the classical Bradley-Terry model.
Phase 3: RL Fine-Tuning (PPO with KL Penalty)
Optimize learned reward + KL penalty from SFT policy:
# Objective: max E[r_phi(tau)] - beta * KL(pi_theta || pi_SFT)
# This prevents the policy from deviating too far from safe SFT behavior
policy_rl = copy.deepcopy(policy_sft) # Initialize from SFT
policy_sft.eval() # Freeze SFT policy as reference
for episode in range(2000):
# Generate trajectory under pi_theta
tau = rollout(policy_rl, env)
# Compute rewards from learned model
rewards_learned = [reward_model(s, a) for s, a in tau]
# Compute KL penalty
kl_penalty = 0
for s, a in tau:
dist_rl = policy_rl(s)
dist_sft = policy_sft(s)
kl = torch.distributions.kl_divergence(dist_rl, dist_sft).mean()
kl_penalty += kl
# Total objective
total_reward = sum(rewards_learned) - beta * kl_penalty
# REINFORCE update (or use PPO for better stability)
loss = compute_policy_gradient_loss(policy_rl, tau, total_reward)
loss.backward()
optimizer.step()
Evaluation¶
Metrics: 1. Reward Model Accuracy: Test set AUC for pairwise preferences 2. KL Divergence: \(D_{\text{KL}}(\pi_{\text{RL}} \| \pi_{\text{SFT}})\) (should stay below threshold) 3. Ground Truth GMV: Final policy's actual GMV (not learned reward)
Expected Results: - Phase 1 (SFT): GMV ~10.4 (matches expert templates) - Phase 2: Reward model achieves 78% pairwise accuracy - Phase 3 (RL): GMV improves to ~12.1 (beats SFT, stays safe via KL penalty)
Comparison to Standard RL: - Standard REINFORCE (Ch8): GMV ~11.6 but high variance - RLHF pipeline: GMV ~12.1, more stable (SFT initialization + KL constraint)
Connection to LLM RLHF¶
This is structurally identical to InstructGPT [ouyang:training_language:2022]: - SFT: Train on human demonstrations -> mimics expert search rankers - RM: Learn from pairwise preferences -> "which result set is better?" - PPO: Optimize learned reward + KL penalty -> improves while staying safe
The key insight: RL on learned rewards is more sample-efficient than RL on sparse ground-truth rewards (GMV), especially when human feedback is available.
Part C: Additional Investigations¶
Investigation 8.A: Variance Reduction Techniques¶
Compare variance reduction methods:
- Baseline: None, constant, value network
- Return normalization: Raw, normalized, standardized
- GAE (Generalized Advantage Estimation): \(\lambda \in \{0, 0.9, 0.95, 0.99\}\)
Protocol:
- Implement GAE in zoosim/policies/reinforce.py (requires value network)
- Measure gradient variance: \(\text{Var}[\nabla J]\) across episodes
- Plot bias-variance tradeoff: GAE \(\lambda\) vs. final return
Expected: GAE with \(\lambda=0.95\) reduces variance by 5x while introducing <10% bias.
Investigation 8.B: Trust Region Analysis¶
Question: Why does REINFORCE sometimes collapse catastrophically?
Approach: - Track policy change per update: \(D_{\text{KL}}(\pi_{\theta_k} \| \pi_{\theta_{k-1}})\) - Identify episodes where KL divergence spikes (large policy change) - Correlate with performance drops
Finding: Large KL steps (>0.5 nats) often precede performance collapse. This motivates PPO's trust region constraint.
Implement Simple Trust Region:
# In update(), after computing loss:
kl_div = compute_kl_divergence(policy_new, policy_old)
if kl_div > kl_threshold:
# Reject update, halve learning rate
learning_rate *= 0.5
Investigation 8.C: Sample Complexity Lower Bounds¶
Theoretical Question: Is REINFORCE's sample inefficiency fundamental or fixable?
Analysis: - Compare sample complexity: REINFORCE \(O(1/\varepsilon^2)\), Q-learning \(O(1/\varepsilon)\) - Derive information-theoretic lower bound for policy gradient (see [agarwal:theory_of_reinforcement_learning:2021]) - Show that Monte Carlo estimation inherently has higher variance than TD
Conclusion: On-policy methods fundamentally need more samples than off-policy (no replay buffer). PPO + parallelism (A3C-style) mitigates this in practice.
Summary¶
These exercises and labs provide hands-on experience with policy gradient methods, from rigorous mathematical proofs (optimal baseline, entropy regularization) to production implementation challenges (sample efficiency, feature learning, RLHF pipelines).
Key Skills Developed: - Variance reduction techniques (baselines, normalization, GAE) - Action distribution design (Gaussian, Beta, squashed) - Failure mode diagnosis (end-to-end learning, entropy collapse) - Modern RL pipelines (RLHF simulation)
Next Steps: - Chapter 9: Off-Policy Evaluation (learn from logged data without live interaction) - Chapter 12: Actor-Critic methods (reduce variance via value function bootstrapping) - Chapter 13: PPO and trust regions (prevent catastrophic updates)
References¶
Code files referenced:
- zoosim/policies/reinforce.py — REINFORCEAgent implementation
- scripts/ch08/reinforce_demo.py — Main training script
- scripts/ch08/neural_reinforce_demo.py — Deep end-to-end experiment
External references: - [@robbins:stochastic_approximation:1951] — Robbins-Monro theorem - [@fazel:global_convergence:2018] — LQR policy gradient convergence - [@ouyang:training_language:2022] — RLHF for InstructGPT - [@agarwal:theory_of_reinforcement_learning:2021] — Sample complexity lower bounds