Chapter 8: Policy Gradient Methods --- Direct Optimization of Search Performance¶

Vlad Prytula¶

Introduction: The Value-Function Detour¶

In Chapters 6 and 7, we optimized search ranking through value-based methods: we learned estimators $Q_\theta(s,a) \approx \mathbb{E}[R \mid s,a]$ and extracted policies by maximization $\pi(s) = \arg\max_a Q_\theta(s,a)$. This indirect approach---learn values, then maximize---has served us well. Our continuous Q-learning agent (Chapter 7) achieved returns of ~25.0, far surpassing random baselines (~8.7) and discrete bandits (~10.4).

But value-based methods have fundamental limitations. Consider three challenges endemic to e-commerce search:

1. The Maximization Bottleneck. For continuous boost weights $a \in \mathbb{R}^K$ (where $K=10$ product categories), computing $\arg\max_a Q(s,a)$ requires optimization at every decision point. Our Chapter 7 agent used Cross-Entropy Method (CEM) with 64 samples per query. This works, but it's expensive and approximate.

2. The Exploration-Exploitation Dilemma in Continuous Spaces. LinUCB (Chapter 6) elegantly balances exploration via uncertainty quantification: $\text{UCB}(a) = \hat{Q}(a) + \beta \sqrt{\text{Var}(a)}$. But for continuous actions, maintaining calibrated uncertainty becomes intractable---our covariance matrix is $K \times K$, and the surface $Q(s,\cdot)$ is nonlinear.

3. The Stochasticity Requirement. E-commerce users are heterogeneous and noisy. A deterministic policy $\pi(s) = a^*$ commits to a single strategy. But what if different users in the same state segment respond differently? A stochastic policy $\pi(a|s)$ can hedge across multiple plausible actions, improving robustness to model misspecification.

Policy gradient methods eliminate the value-function detour. Instead of learning $Q$ and maximizing, we parameterize the policy directly:

$$ \pi_\theta(a|s) = \mathbb{P}(A=a \mid S=s; \theta) \tag{8.1} $$

and optimize the expected return $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau)]$ via gradient ascent. This chapter develops the mathematical foundations (Policy Gradient Theorem), implements REINFORCE---the Monte Carlo policy gradient estimator---and confronts the theory-practice gap through rigorous experiments.

Roadmap: - Section 8.1: Mathematical foundations (performance objective, score function estimator, baseline variance reduction) - Section 8.2: The Policy Gradient Theorem (rigorous proof, connection to HJB equations) - Section 8.3: REINFORCE algorithm (specification and production implementation) - Section 8.4: Gaussian policies for continuous actions (parameterization, entropy regularization) - Section 8.5: Empirical analysis (REINFORCE vs. Q-learning, feature engineering, deep end-to-end attempts) - Section 8.6: Control theory connections (Pontryagin's principle, continuous-time limits) - Section 8.7: Modern context (Actor-Critic, PPO, RLHF)

8.1 Mathematical Foundations¶

8.1.1 The Performance Objective¶

We formalize the policy optimization problem. Let $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ denote our search MDP (defined in Chapter 3). Throughout this chapter we treat $\mathcal{S}$ and $\mathcal{A}$ as Polish spaces equipped with their Borel $\sigma$-algebras so that all kernels and integrals are well-defined; readers working with finite sets can read "Borel" as "discrete" with no loss of generality. The transition kernel $P : \mathcal{S} \times \mathcal{A} \to \Delta(\mathcal{S})$ and reward function $R : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}$ are measurable, and $\gamma \in (0,1]$ is the discount factor.

Practitioner Shortcut

If you are comfortable thinking of states and actions as finite sets or subsets of $\mathbb{R}^d$, you can safely ignore the Polish/Borel language and read all expectations as sums or integrals over $(s,a)$. The regularity assumptions we impose (bounded rewards, smooth policies) are exactly those satisfied by the production REINFORCEAgent in zoosim/policies/reinforce.py: rewards are clipped via SimulatorConfig, and policies are implemented with smooth neural networks and Gaussian heads.

A parameterized policy $\pi_\theta$ defines a conditional distribution over actions given states: - For discrete $\mathcal{A}$, $\pi_\theta(a|s)$ is a probability mass function with $\sum_{a \in \mathcal{A}} \pi_\theta(a|s) = 1$. - For continuous $\mathcal{A} \subseteq \mathbb{R}^K$, $\pi_\theta(a|s)$ is a density with respect to Lebesgue measure satisfying $\int_{\mathcal{A}} \pi_\theta(a|s) \, da = 1$.

We suppress the distinction in notation---$\pi_\theta(a|s)$ always refers to a conditional distribution---but keep in mind which measure is implied. The parameters $\theta \in \mathbb{R}^d$ (e.g., neural network weights) determine both the policy mean and its stochasticity.

A trajectory $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_T)$ is generated by sampling: - Initial state $s_0 \sim \rho_0$ (start-state distribution) - Actions $a_t \sim \pi_\theta(\cdot|s_t)$ - Transitions $s_{t+1} \sim P(\cdot|s_t, a_t)$ - Rewards $r_t = R(s_t, a_t, s_{t+1})$

The return of a trajectory is the discounted sum of rewards:

$$ G(\tau) = \sum_{t=0}^{T} \gamma^t r_t \tag{8.2} $$

Episode structure. We study episodic tasks where trajectories terminate either after a fixed horizon $T$ or upon reaching an absorbing terminal state. The sum in [EQ-8.2] therefore ranges over the realized trajectory length. For continuing tasks with $\gamma < 1$, the infinite sum converges almost surely under the boundedness assumptions introduced later in this section.

Our goal is to maximize the expected return:

$$ J(\theta) = \mathbb{E}{\tau \sim \pi\theta}[G(\tau)] = \mathbb{E}{\tau \sim \pi\theta}\left[\sum_{t=0}^{T} \gamma^t R(s_t, a_t, s_{t+1})\right] \tag{8.3} $$

We assume $|R(s,a,s')| \leq R_{\max} < \infty$ so that $J(\theta)$ is finite and gradients can interchange with expectations via dominated convergence.

Notation. We write $\tau \sim \pi_\theta$ to denote that $\tau$ is sampled under policy $\pi_\theta$. The trajectory distribution is:

$$ p_\theta(\tau) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) P(s_{t+1}|s_t, a_t) \tag{8.4} $$

This makes the dependence on $\theta$ explicit: changing policy parameters alters the trajectory distribution, which in turn changes expected returns.

Code <-> Config (Performance Objective)

The objective [EQ-8.3] is implemented in the training loop at scripts/ch08/reinforce_demo.py:84-110: - Episode rollout: Lines 89-98 generate trajectories $\tau$ by sampling actions $a_t \sim \pi_\theta(\cdot|s_t)$ - Return computation: agent.update() computes $G(\tau)$ at zoosim/policies/reinforce.py:149-155 - Gradient ascent: loss.backward() at line 185 differentiates $J(\theta)$ w.r.t. $\theta$

8.1.2 The Challenge: Differentiating Expectations¶

The difficulty is that $J(\theta)$ is an expectation over a $\theta$-dependent distribution. We cannot simply "differentiate the expectation" naively because the sample space itself changes with $\theta$. This is unlike supervised learning, where we differentiate $\mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(f_\theta(x), y)]$ and the data distribution $\mathcal{D}$ is fixed.

To make progress, we employ the score function estimator (also called the log-derivative trick or REINFORCE trick). This is a general technique from stochastic optimization, dating back to [@williams:simple_statistical:1992] in RL and earlier work in econometrics.

At a high level: - Goal: differentiate an expectation $J(\theta) = \mathbb{E}_{x \sim p_\theta}[f(x)]$. - Obstacle: the distribution $p_\theta$ itself depends on $\theta$, so we cannot treat it as fixed. - Trick: rewrite the gradient as $\nabla_\theta J(\theta) = \mathbb{E}_{x \sim p_\theta}[f(x)\,\nabla_\theta \log p_\theta(x)]$, which we can estimate from samples by evaluating $f(x)$ and the score $\nabla_\theta \log p_\theta(x)$.

Lemma 8.1.1 (Score Function Estimator / Log-Derivative Trick) {#LEM-8.1}

Let $(\mathcal{X}, \Sigma, \mu)$ be a measure space, $\Theta \subseteq \mathbb{R}^d$ an open set, and $\{p_\theta\}_{\theta \in \Theta}$ a family of probability densities on $\mathcal{X}$ with respect to $\mu$. Fix $\theta_0 \in \Theta$ and assume:

(Differentiability) For $\mu$-almost every $x$, the map $\theta \mapsto p_\theta(x)$ is $C^1$ in a neighborhood of $\theta_0$.
(Dominated derivative) There exists $g \in L^1(\mathcal{X}, \mu)$ such that $\|\nabla_\theta p_\theta(x)\| \leq g(x)$ for all $\theta$ near $\theta_0$ and $\mu$-a.e. $x$.
(Integrability of $f$) $f : \mathcal{X} \to \mathbb{R}$ satisfies $\int_{\mathcal{X}} |f(x)| p_\theta(x) \, d\mu(x) < \infty$ for all $\theta$ near $\theta_0$.

Then:

$$ \nabla_\theta \mathbb{E}{x \sim p\theta}[f(x)] = \mathbb{E}{x \sim p\theta}[f(x) \nabla_\theta \log p_\theta(x)] \tag{8.5} $$

Proof. Work in the continuous case; the discrete case replaces integrals with sums and the same argument goes through. Expand the expectation:

\[ \mathbb{E}_{x \sim p_\theta}[f(x)] = \int_{\mathcal{X}} p_\theta(x) f(x) \, dx \]

By (2) the family $\{\nabla_\theta p_\theta f\}$ is dominated by $g(x)|f(x)|$, which is integrable by (3). Dominated convergence therefore allows us to differentiate under the integral sign:

\[ \nabla_\theta \mathbb{E}_{x \sim p_\theta}[f(x)] = \int_{\mathcal{X}} \nabla_\theta p_\theta(x) \, f(x) \, dx \]

Now apply the identity $\nabla_\theta p_\theta(x) = p_\theta(x) \nabla_\theta \log p_\theta(x)$. This follows from the chain rule:

\[ \nabla_\theta \log p_\theta(x) = \frac{1}{p_\theta(x)} \nabla_\theta p_\theta(x) \quad \implies \quad \nabla_\theta p_\theta(x) = p_\theta(x) \nabla_\theta \log p_\theta(x) \]

Substitute:

\[ \int_{\mathcal{X}} \nabla_\theta p_\theta(x) \, f(x) \, dx = \int_{\mathcal{X}} p_\theta(x) \, \nabla_\theta \log p_\theta(x) \, f(x) \, dx = \mathbb{E}_{x \sim p_\theta}[f(x) \nabla_\theta \log p_\theta(x)] \]

This completes the proof. $\square$

Remark. Neural policies built from smooth activations (tanh, sigmoid, softplus) satisfy (1), compact parameter neighborhoods guarantee (2), and bounded episodic rewards ($|r_t| \leq R_{\max}$ implies $|G| \leq R_{\max}/(1-\gamma)$, so $f$ is bounded) ensure (3) and the dominated convergence step. Thus the lemma covers the policies we deploy in production.

Intuition. The score function $\nabla_\theta \log p_\theta(x)$ measures how the log-probability of outcome $x$ changes with $\theta$. High-reward outcomes $x$ (large $f(x)$) with positive score (increasing probability) yield positive gradient contributions, pushing $\theta$ toward policies that make good outcomes more likely. Crucially, we do not need to differentiate $f(x)$---this is why policy gradients work in black-box environments where $f$ (e.g., user purchase behavior) is unknown.

Remark. The regularity condition for differentiating under the integral requires $p_\theta$ to have sufficient smoothness (e.g., $p_\theta \in C^1$ with integrable derivatives). For neural network policies with continuous activations (tanh, sigmoid, softplus), this holds. For discrete distributions (softmax over finite actions), the discrete analogue is straightforward.

8.2 The Policy Gradient Theorem¶

We now apply [LEM-8.1] to derive the foundational result of policy gradient methods. This theorem, due to [@sutton:policy_gradient:2000] with roots in [@williams:simple_statistical:1992], shows that the gradient of expected return can be estimated from trajectory samples without knowing the environment dynamics.

Theorem 8.2.1 (Policy Gradient Theorem) {#THM-8.2}

Let $\pi_\theta$ be a differentiable policy whose log-probabilities admit gradients and assume the reward function is bounded: $|R(s,a,s')| \leq R_{\max} < \infty$. Then for the MDP $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ the gradient of the performance objective [EQ-8.3] exists and equals

$$ \nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, G_t \right] \tag{8.6} $$

where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the return-to-go from time $t$.

Proof. We expand $J(\theta)$ using the trajectory distribution [EQ-8.4]:

\[ J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau)] = \int p_\theta(\tau) G(\tau) \, d\tau \]

Apply [LEM-8.1] with $p_\theta(\tau)$ as the distribution and $G(\tau)$ as the reward function:

$$ \nabla_\theta J(\theta) = \mathbb{E}{\tau \sim \pi\theta}[G(\tau) \nabla_\theta \log p_\theta(\tau)] \tag{8.7} $$

Now compute $\nabla_\theta \log p_\theta(\tau)$. From [EQ-8.4]:

\[ \log p_\theta(\tau) = \log \rho_0(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T-1} \log P(s_{t+1}|s_t, a_t) \]

Differentiate:

\[ \nabla_\theta \log p_\theta(\tau) = \nabla_\theta \left[\sum_{t=0}^{T-1} \log \pi_\theta(a_t|s_t)\right] + \nabla_\theta \left[\log \rho_0(s_0) + \sum_{t=0}^{T-1} \log P(s_{t+1}|s_t, a_t)\right] \]

Key observation: The start-state distribution $\rho_0$ and transition dynamics $P$ do not depend on policy parameters $\theta$. Therefore:

\[ \nabla_\theta \log \rho_0(s_0) = 0, \quad \nabla_\theta \log P(s_{t+1}|s_t, a_t) = 0 \]

This gives:

$$ \nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \tag{8.8} $$

Substitute [EQ-8.8] into [EQ-8.7]:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)\right] \]

Now expand $G(\tau) = \sum_{t=0}^{T} \gamma^t r_t$:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\left(\sum_{t=0}^{T} \gamma^t r_t\right) \sum_{t'=0}^{T-1} \nabla_\theta \log \pi_\theta(a_{t'}|s_{t'})\right] \]

Distribute the product:

\[ = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t'=0}^{T-1} \nabla_\theta \log \pi_\theta(a_{t'}|s_{t'}) \sum_{t=0}^{T} \gamma^t r_t\right] \]

Causality refinement. Fix $t'$ and consider any $k < t'$. Using the tower property of conditional expectation and the fact that $r_k$ is measurable with respect to the history up to time $k$:

\[ \mathbb{E}_{\tau \sim \pi_\theta}\!\left[\nabla_\theta \log \pi_\theta(a_{t'}|s_{t'}) \, \gamma^k r_k\right] = \mathbb{E}\!\left[\mathbb{E}\left[\nabla_\theta \log \pi_\theta(a_{t'}|s_{t'}) \mid s_{t'}\right] \gamma^k r_k\right] \]

The inner expectation vanishes because

\[ \mathbb{E}_{a \sim \pi_\theta(\cdot|s_{t'})}[\nabla_\theta \log \pi_\theta(a|s_{t'})] = \int_{\mathcal{A}} \pi_\theta(a|s_{t'}) \nabla_\theta \log \pi_\theta(a|s_{t'}) \, da = \nabla_\theta \int_{\mathcal{A}} \pi_\theta(a|s_{t'}) \, da = \nabla_\theta 1 = 0, \]

exactly the mechanism formalized in [THM-8.2.2] (Baseline Invariance), which we prove later in this section. Therefore every term involving rewards collected before $t'$ contributes zero in expectation, and we may replace the full return with the return-to-go $G_{t'} = \sum_{k=t'}^{T} \gamma^{k-t'} r_k$ without bias. Equivalently, for each $t'$ we can replace $G(\tau)$ by $G_{t'}$ in [EQ-8.7] without changing the expectation. Consequently,

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \, G_t\right] \]

Extending the sum to $t=T$ (where $G_T = 0$ typically) completes the proof. $\square$

Interpretation. [THM-8.2] says that to estimate $\nabla_\theta J(\theta)$, we: 1. Sample a trajectory $\tau = (s_0, a_0, r_0, \ldots, s_T)$ from $\pi_\theta$ 2. Compute returns-to-go $G_t$ at each timestep 3. Weight the policy gradient $\nabla_\theta \log \pi_\theta(a_t|s_t)$ by $G_t$ 4. Average over many trajectories to reduce variance

Critically, this estimator requires only rollouts---no model of $P(s'|s,a)$, no value function bootstrapping. This is model-free, Monte Carlo policy optimization.

Discounting convention. We discount the return-to-go relative to the local timestep: $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$. This differs from the global convention $\sum_{k=t}^{T} \gamma^{k} r_k$ by a factor $\gamma^t$ that does not depend on $a_t$, so both lead to identical gradients.

Remark (Connection to Control Theory). In continuous time, [THM-8.2] connects to Pontryagin's Maximum Principle: the costate (adjoint) dynamics naturally arise from differentiating the Hamiltonian. We explore this in Section 8.6, and in discrete time Exercise 8.4 (Policy Gradient for LQR) together with the helper script scripts/ch08/discounted_lqr_gain.py shows how the policy gradient recovers the optimal LQR gain in 1D.

8.2.1 Variance Reduction via Baselines¶

The raw estimator [EQ-8.6] is unbiased but has high variance. Returns $G_t$ can vary dramatically across episodes (e.g., in sparse-reward environments, most episodes yield $G_t=0$). High variance slows convergence and destabilizes learning.

The standard solution is to subtract a baseline $b(s_t)$ that does not depend on $a_t$. The following theorem formalizes this.

Theorem 8.2.2 (Baseline Invariance) {#THM-8.2.2}

Let $b : \mathcal{S} \to \mathbb{R}$ be any function that does not depend on $a$. Define the baselined gradient estimator:

$$ \hat{g}(\theta) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, (G_t - b(s_t)) \tag{8.9} $$

Then $\mathbb{E}_{\tau \sim \pi_\theta}[\hat{g}(\theta)] = \nabla_\theta J(\theta)$. That is, subtracting a baseline does not introduce bias.

Proof. We must show that $\mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \, b(s_t)\right] = 0$. Fix timestep $t$ and consider:

\[ \mathbb{E}_{s_t, a_t \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a_t|s_t) \, b(s_t)] = \mathbb{E}_{s_t}\left[b(s_t) \mathbb{E}_{a_t \sim \pi_\theta(\cdot|s_t)}[\nabla_\theta \log \pi_\theta(a_t|s_t)]\right] \]

The inner expectation is:

\[ \mathbb{E}_{a_t \sim \pi_\theta(\cdot|s_t)}[\nabla_\theta \log \pi_\theta(a_t|s_t)] = \int \pi_\theta(a|s_t) \nabla_\theta \log \pi_\theta(a|s_t) \, da \]

Apply the identity $\nabla_\theta \log \pi = \frac{1}{\pi} \nabla_\theta \pi$:

\[ = \int \pi_\theta(a|s_t) \frac{\nabla_\theta \pi_\theta(a|s_t)}{\pi_\theta(a|s_t)} \, da = \int \nabla_\theta \pi_\theta(a|s_t) \, da = \nabla_\theta \int \pi_\theta(a|s_t) \, da \]

Since $\int \pi_\theta(a|s_t) \, da = 1$ (probability normalization), its gradient is zero:

\[ \nabla_\theta \int \pi_\theta(a|s_t) \, da = \nabla_\theta 1 = 0 \]

Therefore:

\[ \mathbb{E}_{a_t \sim \pi_\theta(\cdot|s_t)}[\nabla_\theta \log \pi_\theta(a_t|s_t)] = 0 \]

Summing over all timesteps completes the proof. $\square$

Optimal Baseline. The variance-minimizing baseline (in expectation over $a$) is $b^*(s) = \mathbb{E}_{a \sim \pi(\cdot|s)}[G \mid s]$, which is precisely the value function $V^\pi(s)$ (see Exercise 8.1 in the companion exercises_labs.md for a full derivation and a weighted variant where the score function norm appears in the weights). In practice, we use: - Constant baseline: $b = \bar{G}$ (running average of returns) - Learned value baseline: $b(s) = V_\phi(s)$ (a neural network critic, leading to Actor-Critic methods in Chapter 12)

Our REINFORCE implementation uses a simple running average baseline with exponential moving average (EMA):

Code <-> Algorithm (Baseline Subtraction)

The baseline $b$ from [EQ-8.9] is implemented in zoosim/policies/reinforce.py:160-173: - Line 162: batch_mean = returns_t.mean().item() computes episode return average - Line 163-166: self.baseline = momentum * self.baseline + (1 - momentum) * batch_mean maintains EMA - Line 170: returns_t = (returns_t - returns_t.mean()) / returns_t.std() normalizes (stronger variance reduction)

Remark (Normalization). Return normalization $(G_t - \mu) / \sigma$ is common in deep RL [@schulman:high_dimensional:2015]. This is not unbiased (it introduces correlation between gradient estimates in a batch), but in practice it dramatically stabilizes training and is universally used in modern implementations.

8.3 The REINFORCE Algorithm¶

We now specify the algorithm. REINFORCE (REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility---yes, the acronym is contrived) was introduced by [@williams:simple_statistical:1992] as the first practical policy gradient method.

Algorithm 8.3.1 (REINFORCE with Baseline) {#ALG-8.1}

Input: - Differentiable policy $\pi_\theta(a|s)$ with parameters $\theta$ - Learning rate $\alpha > 0$ - Discount factor $\gamma \in (0,1]$ - Baseline function $b(s)$ (optional, can be constant or learned)

Procedure: 1. Initialize $\theta$ randomly (e.g., Xavier initialization for neural networks) 2. Initialize baseline $b \leftarrow 0$ (if using constant baseline) 3. For episode $e = 1, 2, \ldots, N$: 1. Generate trajectory $\tau = (s_0, a_0, r_0, \ldots, s_T)$: - Sample initial state $s_0 \sim \rho_0$ - For $t = 0, \ldots, T-1$: - Sample action $a_t \sim \pi_\theta(\cdot|s_t)$ - Execute $a_t$, observe $r_t, s_{t+1}$ 2. Compute returns-to-go: $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ for $t = 0, \ldots, T$ 3. Update baseline: $b \leftarrow (1-\beta) b + \beta \bar{G}$ where $\bar{G} = \frac{1}{T+1}\sum_t G_t$ (EMA with momentum $\beta$) 4. Compute policy gradient estimate: $$ \hat{g} = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \, (G_t - b) \tag{8.10} $$ {#EQ-8.10} 5. Update parameters: $\theta \leftarrow \theta + \alpha \hat{g}$ 4. Return $\theta$

Convergence (Informal). Under standard stochastic approximation conditions (diminishing learning rate $\sum_e \alpha_e = \infty$, $\sum_e \alpha_e^2 < \infty$, bounded variance), REINFORCE converges to a local maximum of $J(\theta)$ (not necessarily global, since policy optimization is nonconvex). Formal convergence proofs require restricted policy classes or assume regularity conditions on the MDP; see [@sutton:policy_gradient:2000; @baxter:infinite_horizon:2001] for details.

Why "Monte Carlo"? REINFORCE uses full-episode returns $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$, which are unbiased estimates of $Q^{\pi_\theta}(s_t, a_t)$ but have high variance. In contrast, temporal-difference (TD) policy gradients (Actor-Critic methods, Chapter 12) replace $G_t$ with bootstrapped estimates $r_t + \gamma V(s_{t+1})$, reducing variance at the cost of bias. REINFORCE sits at the Monte Carlo extreme: unbiased, high variance, no bootstrapping.

8.4 Gaussian Policies for Continuous Actions¶

Our search ranking problem requires continuous action spaces: boost weights $a \in \mathbb{R}^K$ where $K$ is the number of product categories. We parameterize the policy as a diagonal Gaussian:

$$ \pi_\theta(a|s) = \mathcal{N}!\left(a \mid \mu_\theta(s), \operatorname{diag}(\sigma_1^2, \ldots, \sigma_K^2)\right) = \prod_{i=1}^{K} \frac{1}{\sqrt{2\pi\sigma_i^2}} \exp\left(-\frac{(a_i - \mu_{\theta,i}(s))^2}{2\sigma_i^2}\right) \tag{8.11} $$

where: - $\mu_\theta : \mathcal{S} \to \mathbb{R}^K$ is a neural network (MLP) that outputs the mean action - $\sigma = (\sigma_1, \ldots, \sigma_K)$ are the standard deviations, parameterized via learnable $\log \sigma_i$ to ensure positivity (typically initialized to $\log \sigma_i = 0$, i.e., $\sigma_i = 1$), giving a diagonal covariance $\operatorname{diag}(\sigma_1^2, \ldots, \sigma_K^2)$

8.4.1 Parameterization and Gradient Computation¶

We represent $\sigma$ via log-standard-deviation $\log \sigma$ to ensure positivity and numerical stability. The full parameter vector is $\theta = (\theta_\mu, \log \sigma)$ where $\theta_\mu$ are the MLP weights for $\mu$.

The log-probability is:

$$ \log \pi_\theta(a|s) = -\frac{1}{2} \sum_{i=1}^{K} \left(\frac{a_i - \mu_{\theta,i}(s)}{\sigma_i}\right)^2 - \sum_{i=1}^{K} \log \sigma_i - \frac{K}{2} \log(2\pi) \tag{8.12} $$

As a sanity check, consider the 1D case with fixed standard deviation $\sigma$: $$ \log \pi_\theta(a|s) = -\frac{1}{2}\left(\frac{a - \mu_\theta(s)}{\sigma}\right)^2 - \log\big(\sigma \sqrt{2\pi}\big). $$ Differentiating with respect to the mean gives $$ \frac{\partial}{\partial \mu_\theta(s)} \log \pi_\theta(a|s) = \frac{a - \mu_\theta(s)}{\sigma^2}, $$ which is exactly the scalar score term used in implementations. The multivariate gradients below are just this computation applied coordinate-wise and then backpropagated through the neural network that outputs $\mu_\theta(s)$.

Differentiate w.r.t. $\mu_\theta$:

$$ \nabla_{\mu_\theta} \log \pi_\theta(a|s) = \sum_{i=1}^{K} \frac{a_i - \mu_{\theta,i}(s)}{\sigma_i^2} \nabla_{\mu_\theta} \mu_{\theta,i}(s) \tag{8.13} $$

Differentiate w.r.t. $\log \sigma_i$ (using $\frac{d}{d(\log \sigma)} = \sigma \frac{d}{d\sigma}$):

$$ \nabla_{\log \sigma_i} \log \pi_\theta(a|s) = \left(\frac{(a_i - \mu_{\theta,i}(s))^2}{\sigma_i^2} - 1\right) \tag{8.14} $$

PyTorch Implementation. We use torch.distributions.Normal which handles these computations automatically:

class GaussianPolicy(nn.Module):
    def __init__(self, obs_dim: int, action_dim: int, hidden_sizes: Tuple[int, ...]):
        super().__init__()
        # MLP for mean
        layers = []
        prev_dim = obs_dim
        for size in hidden_sizes:
            layers.append(nn.Linear(prev_dim, size))
            layers.append(nn.Tanh())
            prev_dim = size
        self.net = nn.Sequential(*layers)
        self.mu_head = nn.Linear(prev_dim, action_dim)

        # Learnable log_std
        self.log_std = nn.Parameter(torch.zeros(action_dim))

    def forward(self, obs: torch.Tensor) -> Distribution:
        mu = self.mu_head(self.net(obs))
        std = torch.exp(torch.clamp(self.log_std, min=-2.0, max=1.0))
        return Normal(mu, std)

Action Clipping. The Gaussian distribution is unbounded, but our environment enforces $|a_i| \leq a_{\max}$ (e.g., $a_{\max}=5.0$). We handle this via post-hoc clipping:

\[ a_{\text{clipped}} = \text{clip}(a_{\text{sampled}}, -a_{\max}, +a_{\max}) \]

This introduces bias (clipped actions are not properly distributed according to $\pi_\theta$), but in practice it works well. More principled alternatives include: - Squashed Gaussian: $a = a_{\max} \tanh(\tilde{a})$ where $\tilde{a} \sim \mathcal{N}(\mu, \sigma^2)$ (used in SAC [@haarnoja:soft_actor_critic:2018]) - Beta distribution: Natural support on $[0,1]$, scaled to $[-a_{\max}, a_{\max}]$

Code <-> Simulator (Action Bounds)

The clipping bound $a_{\max}$ from our Gaussian policy [EQ-8.11] maps to: - Config: SimulatorConfig.action.a_max at zoosim/core/config.py:52 - Environment: np.clip(action, -self.a_max, self.a_max) at zoosim/envs/gym_env.py:89 - Policy (no explicit clip): The policy outputs unbounded samples; clipping happens in the environment

8.4.2 Entropy Regularization¶

A critical issue with policy gradients is premature convergence: the policy becomes deterministic too quickly ($\sigma \to 0$), eliminating exploration and trapping the agent in local optima. To prevent this, we add an entropy bonus:

$$ J_{\text{ent}}(\theta) = J(\theta) + \beta \mathbb{E}{s \sim \pi\theta}[H(\pi_\theta(\cdot|s))] \tag{8.15} $$

where $H(\pi) = -\mathbb{E}_{a \sim \pi}[\log \pi(a)]$ is the Shannon entropy and $\beta > 0$ is a hyperparameter (typically $\beta \in [0.001, 0.1]$).

For a Gaussian policy [EQ-8.11], the entropy is:

$$ H(\pi_\theta(\cdot|s)) = \sum_{i=1}^{K} \frac{1}{2} \log(2\pi e \sigma_i^2) = \frac{K}{2} \log(2\pi e) + \sum_{i=1}^{K} \log \sigma_i \tag{8.16} $$

This is trivial to compute and differentiate. The entropy gradient w.r.t. $\log \sigma_i$ is simply $1$ (increasing $\sigma$ increases entropy).

Practical Impact. In our experiments (Section 8.5), adding entropy regularization ($\beta=0.01$) was essential for stability. Without it, the policy collapsed to deterministic behavior within 500 episodes, achieving poor final performance (~9.0). With entropy regularization, the agent maintained exploration and converged to returns ~11.5.

Code <-> Algorithm (Entropy Regularization)

The entropy bonus [EQ-8.15] is implemented at zoosim/policies/reinforce.py:180-182: - Line 132: self.entropies.append(dist.entropy().sum()) accumulates entropy per timestep - Line 180: entropy_loss = -self.config.entropy_coef * torch.stack(self.entropies).mean() - Line 182: loss = policy_loss + entropy_loss combines policy and entropy objectives

8.5 Empirical Analysis: The Variance Barrier (REVISED)¶

Vlad Prytula

This section demonstrates the "Failure -> Diagnosis -> Fix" pattern that recurs throughout deep RL: we implement the theoretically sound algorithm (REINFORCE, [ALG-8.1]), observe empirical failure (high variance, oscillating returns), diagnose the root cause mathematically (state-dependent variance not removed by scalar baseline), and deploy the fix (learned value baseline, [THM-8.5.2]). This is not a failure of theory---it's theory guiding us to the correct architecture.

8.5.1 Experiment Setup¶

Environment: GymZooplusEnv from zoosim/envs/gym_env.py (Gymnasium wrapper around our search simulator)

State space: Rich features (Dim 17) including:
User segment one-hot (4-dim): [price_sensitive, premium, random, strategic]
Query category one-hot (4-dim): categories from SimulatorConfig
Estimated user preferences (2-dim): theta_price_est, theta_margin_est
Ranking statistics (7-dim): avg_price, avg_margin, diversity, strategic_coverage, etc.
Action space: Continuous boost weights $a \in [-5.0, +5.0]^{10}$ (10 product categories)
Episode structure: Single search session (user query -> ranking -> clicks -> potential purchase)
Reward: GMV (Gross Merchandise Value) if purchase occurs, else 0
Sparse rewards: ~5-10% of episodes yield nonzero reward
Stochastic clicks: Position bias model (PBM, Chapter 5) with probabilistic user behavior

Agent configuration (both vanilla and baseline): - Policy: Gaussian MLP with hidden layers [64, 64], Tanh activations - Learning rate: $\alpha_\theta = 3 \times 10^{-4}$ (policy), $\alpha_\phi = 1 \times 10^{-3}$ (value, baseline only) - Discount: $\gamma = 0.99$ - Entropy coefficient: $\beta = 0.01$ (prevent premature convergence) - Gradient clipping: Max norm 1.0 - Seed: 2025 (reproducibility)

Code <-> Config (Experiment Alignment)

All hyperparameters map to configuration: - Action bounds: SimulatorConfig.action.a_max = 5.0 at zoosim/core/config.py:52 - Rich features: GymZooplusEnv(rich_features=True) at zoosim/envs/gym_env.py:34 - Policy architecture: REINFORCEConfig.hidden_sizes = (64, 64) at zoosim/policies/reinforce.py:44 - Learning rates: REINFORCEConfig.learning_rate, REINFORCEBaselineConfig.value_lr

Baselines for comparison: - Random policy: Uniform $a \sim \mathcal{U}([-5, 5]^{10})$ -> Avg return ~8.7 - Discrete LinUCB (Chapter 6): Template bandits -> Avg return ~10.4 - Continuous Q-Learning (Chapter 7): Q-network + CEM -> Avg return ~20.06

All experiments run for 5,000 episodes (50,000 for extended vanilla REINFORCE to confirm non-convergence). We report average return over final 100 episodes as the primary metric.

8.5.2 Act I: The Variance Trap (Vanilla REINFORCE Failure)¶

We first deploy the canonical REINFORCE implementation from [ALG-8.1] using zoosim/policies/reinforce.py with a simple scalar baseline (exponential moving average of returns, momentum 0.9). The results reveal a fundamental challenge.

Experiment: python scripts/ch08/reinforce_demo.py --episodes 50000 --seed 2025

Result: Final 100-episode average return: 7.99 (worse than random baseline 8.7!)

The Instability Pattern¶

Despite 50,000 episodes of training---more than enough for convergence in tabular settings---the agent never stabilizes:

Ep   500 | Return:   8.01 | Entropy: 14.37 | Loss: -310.13
Ep  1000 | Return:   3.40 | Entropy: 14.44 | Loss:  -31.84
Ep  2300 | Return:  14.87 | Entropy: 14.58 | Loss: -282.19  <- Lucky streak!
Ep  2400 | Return:  12.44 | Entropy: 14.59 | Loss: -142.50
Ep  2700 | Return:  14.18 | Entropy: 14.61 | Loss: 1305.29  <- Massive gradient spike
Ep  3700 | Return:   8.06 | Entropy: 14.73 | Loss:   96.92  <- Back to baseline
Ep  5900 | Return:  21.77 | Entropy: 15.01 | Loss: -256.62  <- Another spike
Ep  6000 | Return:  13.25 | Entropy: 15.00 | Loss:  850.96
Ep  6100 | Return:   8.55 | Entropy: 15.02 | Loss: -119.03  <- Crash
...
Ep 13500 | Return:  17.70 | Entropy: 15.84 | Loss: -589.97  <- Best so far
Ep 13600 | Return:   9.42 | Entropy: 15.85 | Loss: -219.38  <- Immediate collapse
...
Ep 50000 | Return:   7.99 | Entropy: 15.12 | Loss: -301.47  <- Still oscillating

Visualization: If we plot the rolling mean (window 100) and +/-1 std bands:

Return
  25 |                    *
     |                 *     *
  20 |              *           *
     |           *                 *
  15 |        *                       *     *
     |     *                             *     *
  10 | *                                         *
     |                                               *
   5 |_______________________________________________v_____________
     0       10000      20000      30000      40000      50000
                         Episode

The agent explores (entropy increases from 14.2 -> 15.1), but performance oscillates wildly. This is not a convergence problem---it's a variance problem.

What's Happening? Environmental Luck vs. Policy Skill¶

In our e-commerce simulator, returns conflate two independent factors:

Policy quality: Did the boost weights $a_t$ promote relevant, high-margin products?
Environmental randomness: Did this particular user happen to click and purchase?

Consider two episodes with identical state $s_0$ and action $a_0$:

Episode A (Lucky): - User segment: price_sensitive, query: "budget dry food" - Agent boosts cheap products to top positions (good policy) - User clicks position 2 (CTR = 0.30 for position 2, random draw succeeds) - User purchases (conversion rate 12%, random draw succeeds) - Return: $G_A = 45$ (product GMV)

Episode B (Unlucky): - Identical state, identical action - User clicks position 2 (CTR = 0.30, random draw fails this time) - User does not engage further - Return: $G_B = 0$ (no clicks -> no purchase)

Vanilla REINFORCE gradient: - Episode A: $\nabla_\theta J \approx \nabla \log \pi_\theta(a_0|s_0) \cdot (45 - 10) = \nabla \log \pi \cdot 35$ (huge positive push) - Episode B: $\nabla_\theta J \approx \nabla \log \pi_\theta(a_0|s_0) \cdot (0 - 10) = \nabla \log \pi \cdot (-10)$ (negative push)

The policy receives contradictory gradients for the same state-action pair because the scalar baseline $b = 10$ cannot distinguish state quality. The agent is chasing ghosts---lucky outcomes that don't reflect systematic skill.

Failure Mode: Ghost Chasing

Vanilla REINFORCE reinforces actions that precede high returns, even when those returns are due to environmental luck rather than policy quality. In our stochastic simulator: - 90% of episodes yield $G_t = 0$ (no purchase) - 10% yield $G_t \in [30, 60]$ (purchase with variable GMV)

A single lucky episode can produce a gradient magnitude 10x larger than typical episodes, causing the policy to "chase" that lucky trajectory and destabilize.

8.5.3 Act II: Mathematical Diagnosis (The Variance Term)¶

Why does the scalar baseline fail? We analyze the gradient estimator variance to identify the root cause.

The Variance Source¶

Recall the REINFORCE gradient estimator [EQ-8.6] with baseline $b$:

$$ \hat{g}t = \nabla\theta \log \pi_\theta(a_t|s_t) (G_t - b) \tag{8.26} $$

Theorem 8.5.1 (Gradient Estimator Variance with Scalar Baseline) {#THM-8.5.1}

For a scalar baseline $b \in \mathbb{R}$ (not state-dependent), the variance of [EQ-8.26] is:

$$ \text{Var}[\hat{g}t] = \mathbb{E}}\left[\mathbb{E{a_t \sim \pi\theta(\cdot|s_t)}\left[|\nabla_\theta \log \pi_\theta(a_t|s_t)|^2 (G_t - b)^2\right]\right] \tag{8.27} $$

Expanding the squared term:

\[ (G_t - b)^2 = ([G_t - V^\pi(s_t)] + [V^\pi(s_t) - b])^2 \]

where $V^\pi(s_t) = \mathbb{E}[G_t \mid s_t]$ is the true value function. This decomposes into:

\[ \begin{align} (G_t - b)^2 &= (G_t - V^\pi(s_t))^2 \quad \text{(action variance, irreducible)} \\ &\quad + 2(G_t - V^\pi(s_t))(V^\pi(s_t) - b) \quad \text{(cross term, zero in expectation)} \\ &\quad + (V^\pi(s_t) - b)^2 \quad \text{(state variance, removable!)} \end{align} \]

The third term is the state-dependent variance that a scalar baseline cannot remove. For each state $s_t$, the expected squared deviation includes $(V^\pi(s_t) - b)^2$, which is positive unless $b$ magically equals $V^\pi(s_t)$ for all states simultaneously (impossible in non-trivial MDPs).

Proof. The gradient estimator is unbiased ($\mathbb{E}[\hat{g}_t] = \nabla_\theta J(\theta)$ by [THM-8.2.2]), so:

\[ \text{Var}[\hat{g}_t] = \mathbb{E}[\|\hat{g}_t\|^2] - \|\mathbb{E}[\hat{g}_t]\|^2 \]

The first term expands as [EQ-8.27]. The cross term vanishes by the tower property:

\[ \mathbb{E}[(G_t - V^\pi(s_t))(V^\pi(s_t) - b) \mid s_t] = (V^\pi(s_t) - b) \mathbb{E}[(G_t - V^\pi(s_t)) \mid s_t] = (V^\pi(s_t) - b) \cdot 0 = 0 \]

Therefore the variance decomposes into action variance (unavoidable) plus state variance (removable). $\square$

Quantifying the Gap in Our Environment¶

Empirical measurement (from vanilla REINFORCE run): - States with user_segment = price_sensitive and query_category = budget: $V^\pi(s) \approx 15$ (high purchase probability) - States with user_segment = random and query_category = generic: $V^\pi(s) \approx 5$ (low engagement) - Scalar baseline: $b = 10$ (global average)

Variance contribution from state-to-state differences:

\[ \text{Var}_s(V^\pi(s) - b) = \mathbb{E}_s[(V^\pi(s) - 10)^2] \approx \frac{1}{2}[(15-10)^2 + (5-10)^2] = \frac{1}{2}[25 + 25] = 25 \]

This is added to every gradient estimate, regardless of action quality. Even if the policy takes a near-optimal action in a low-value state ($s$ with $V \approx 5$), the gradient gets scaled by $(G_t - 10)^2 \approx 25$ instead of $(G_t - 5)^2 \approx 0$, injecting noise.

What we need: A state-dependent baseline $b(s_t)$ that adapts to each state's intrinsic value.

The Optimal Baseline (Revisited with Proof)¶

Theorem 8.5.2 (Optimal Baseline is the Value Function) {#THM-8.5.2}

Among all state-dependent baselines $b : \mathcal{S} \to \mathbb{R}$, the variance-minimizing choice for [EQ-8.27] is:

$$ b^*(s) = V^{\pi_\theta}(s) = \mathbb{E}{\tau \sim \pi\theta}[G_t \mid s_t = s] \tag{8.28} $$

Proof. Fix state $s_t$ and consider the variance over actions:

\[ \text{Var}_{a_t}[\hat{g}_t \mid s_t] = \mathbb{E}_{a_t \sim \pi(\cdot|s_t)}\left[\|\nabla \log \pi_\theta(a_t|s_t)\|^2 (G_t - b(s_t))^2\right] \]

To minimize, take the derivative w.r.t. $b(s_t)$ and set to zero:

\[ \frac{\partial}{\partial b} \mathbb{E}_{a_t}[\|\nabla \log \pi\|^2 (G_t - b)^2] = \mathbb{E}_{a_t}[\|\nabla \log \pi\|^2 \cdot 2(b - G_t)] = 0 \]

Solving:

\[ b^* \mathbb{E}_{a_t}[\|\nabla \log \pi\|^2] = \mathbb{E}_{a_t}[\|\nabla \log \pi\|^2 G_t] \]

\[ \implies b^*(s_t) = \frac{\mathbb{E}_{a_t}[\|\nabla \log \pi\|^2 G_t]}{\mathbb{E}_{a_t}[\|\nabla \log \pi\|^2]} \tag{weighted average} \]

If we ignore the score magnitude weighting (common simplification), this reduces to:

\[ b^*(s_t) \approx \mathbb{E}_{a_t \sim \pi(\cdot|s_t)}[G_t \mid s_t] = V^{\pi_\theta}(s_t) \]

This is exactly the value function. $\square$

Remark (Weighted Baseline). The rigorous optimal baseline weights returns by $\|\nabla \log \pi\|^2$, but in practice the unweighted $V^\pi(s)$ performs nearly as well and is simpler to implement. See Exercise 8.1 for the weighted derivation.

Code <-> Theory (Value Network as Baseline)

[THM-8.5.2] is implemented in zoosim/policies/reinforce_baseline.py: - ValueNetwork (lines 32-46): MLP $V_\phi(s) \approx V^{\pi_\theta}(s)$ with architecture matching policy network - Advantage (line 126): advantages = returns_t - values_t.detach() computes $A_t = G_t - V_\phi(s_t)$ - Detachment (.detach()): Ensures $V_\phi$ gradients don't backprop through policy loss (policy only optimizes advantage, not value accuracy)

8.5.4 Act III: The Fix (REINFORCE with Learned Baseline)¶

We now implement [THM-8.5.2] by adding a neural value network $V_\phi(s)$ alongside the policy $\pi_\theta(a|s)$. This is the minimal Actor-Critic architecture: two networks (actor + critic), but still using Monte Carlo returns (no bootstrapping).

Algorithm Specification¶

Algorithm 8.5.1 (REINFORCE with Learned Value Baseline) {#ALG-8.5}

Input: - Policy network $\pi_\theta(a|s)$, learning rate $\alpha_\theta$ - Value network $V_\phi(s)$, learning rate $\alpha_\phi$ - Discount factor $\gamma$, entropy coefficient $\beta$

Procedure (modification of [ALG-8.1]): 1. Generate trajectory $\tau = (s_0, a_0, r_0, \ldots, s_T)$ under $\pi_\theta$ 2. Compute Monte Carlo returns: $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ for all $t$ 3. [NEW] Compute state values: $v_t = V_\phi(s_t)$ for all $t$ 4. [NEW] Compute advantages: $A_t = G_t - v_t$ (elementwise) 5. Policy update: $$\theta \leftarrow \theta + \alpha_\theta \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) A_t + \beta \nabla_\theta H[\pi_\theta(\cdot|s_t)]$$ 6. [NEW] Value update: $$\phi \leftarrow \phi - \alpha_\phi \nabla_\phi \sum_t \|V_\phi(s_t) - G_t\|^2$$ (Minimize MSE between value prediction and Monte Carlo target)

Key differences from vanilla: - Two optimizers: Policy and value trained independently - Advantage centering: Policy sees $(G_t - V_\phi(s_t))$ instead of $(G_t - \bar{G})$ - Value supervision: Critic learns to predict returns via supervised regression

Code <-> Algorithm (Two-Network Architecture)

[ALG-8.5] implemented in zoosim/policies/reinforce_baseline.py: - Policy network (lines 65-70): GaussianPolicy from reinforce.py (reused) - Value network (lines 73-78): New ValueNetwork MLP with identical architecture - Buffer (lines 82-91): Stores states (needed for value training), log-probs, rewards, entropies - Advantage (lines 116-130): Computes $A_t$, detaches baseline, optionally normalizes - Policy gradient (lines 133-143): Weighted by advantages, includes entropy bonus - Value gradient (lines 146-152): MSE loss $\|V_\phi(s_t) - G_t\|^2$

Experimental Results¶

Experiment: python scripts/ch08/reinforce_baseline_demo.py --episodes 5000 --seed 2025

Result: Final 100-episode average return: 13.19

Method	Architecture	Avg Return (Last 100)	Training Stability	Improvement vs. Vanilla
Vanilla REINFORCE	Single policy net	7.99	Unstable (high variance, oscillates)	Baseline
REINFORCE + Learned Baseline	Policy + Value nets	13.19	Stable, monotonic improvement	+65%
Q-Learning (Ch7)	Q-net + CEM	20.06	Sample efficient	+151%

Training Dynamics Comparison¶

Vanilla REINFORCE (50,000 episodes):

Ep  1000 | Return:   3.40 | Entropy: 14.44 | Loss:  -31.84
Ep  2000 | Return:  11.58 | Entropy: 14.52 | Loss: -125.30  <- Looks promising!
Ep  2100 | Return:   5.59 | Entropy: 14.54 | Loss: -115.01  <- Collapsed
Ep  5000 | Return:   7.99 | Entropy: 14.87 | Loss: -297.35
Ep 10000 | Return:  11.17 | Entropy: 15.45 | Loss: -196.90
Ep 13500 | Return:  17.70 | Entropy: 15.84 | Loss: -589.97  <- Spike
Ep 13600 | Return:   9.42 | Entropy: 15.85 | Loss: -219.38  <- Immediate crash
Ep 50000 | Return:   7.99 | Entropy: 15.12 | Loss: -301.47  <- Never stabilized

REINFORCE + Baseline (5,000 episodes):

Ep   100 | Return:   3.66 | Entropy: 14.22 | ValLoss:   0.00  <- Critic initializing
Ep   500 | Return:  11.98 | Entropy: 14.31 | ValLoss:   1.32  <- Learning
Ep  1000 | Return:   9.75 | Entropy: 14.41 | ValLoss:  95.75
Ep  2000 | Return:  12.31 | Entropy: 14.53 | ValLoss:   0.00  <- Stable
Ep  3000 | Return:  13.86 | Entropy: 14.60 | ValLoss: 7261.44 <- Occasional spike
Ep  4000 | Return:  16.01 | Entropy: 14.68 | ValLoss:   1.21  <- Monotonic growth
Ep  5000 | Return:  13.19 | Entropy: 14.82 | ValLoss:   0.03  <- Converged

Key observations:

Variance reduction visible immediately: By episode 500, baseline method stabilizes around 12, while vanilla is still oscillating between 3-15.
10x fewer episodes needed: Baseline converges in ~5,000 episodes (final return 13.19), while vanilla hasn't converged even after 50,000.
Value network learns incrementally: Value loss decreases from hundreds (episode 300-400) to near-zero (episode 5000), indicating $V_\phi(s) \approx V^\pi(s)$ convergence.
Entropy preserved in both methods: Both maintain exploration (entropy ~14-15 throughout), so the improvement is pure variance reduction, not an exploration/exploitation tradeoff artifact.

Visualizing the Advantage Distribution¶

To understand why baseline stabilizes learning, we plot the advantage distribution $A_t = G_t - V_\phi(s_t)$ vs. the raw return deviation $(G_t - b)$ for vanilla:

Vanilla (G_t - b):
   Count
   1200 |        *           (Mode at -10, episodes with no purchase)
   1000 |        *
    800 |        *
    600 |        *
    400 |        *
    200 |  *     *     *     (Scattered +30 to +50, lucky purchases)
      0 |__*_____*_____*_____________________________
        -10     0    +30   +50
                (G_t - b)

Baseline (A_t = G_t - V(s)):
   Count
   1400 |          *         (Mode at 0, advantage-centered)
   1200 |          *
   1000 |          *
    800 |      *   *   *     (Symmetric around 0)
    600 |    *     *     *
    400 |  *       *       *
    200 |*_________*_________*_____________________
       -10       0       +10       +20
                 Advantage A_t

The baseline-centered distribution is symmetric and tighter (std ~7 vs. ~15 for vanilla), directly reducing gradient variance.

8.5.5 The Remaining Gap: Why Q-Learning Still Leads¶

REINFORCE + Baseline achieves 13.19 vs. Q-Learning's 20.06 --- a 50% gap remains. This is not a bug; it reflects fundamental algorithmic differences.

Sample Efficiency: Off-Policy vs. On-Policy¶

Q-Learning (Off-Policy, Chapter 7): - Uses a replay buffer storing $(s, a, r, s')$ transitions - Each transition is reused ~20 times via minibatch sampling (batch size 64, buffer size 10k) - Effective sample count: $N_{\text{episodes}} \times T \times 20$ gradient updates from $N_{\text{episodes}} \times T$ environment steps

REINFORCE (On-Policy): - Collects trajectory $\tau$, computes gradient, discards $\tau$ - Using old trajectories from $\pi_{\theta_{\text{old}}}$ to estimate $\nabla_\theta J(\theta_{\text{new}})$ introduces bias (policy gradient theorem requires on-policy samples) - Effective sample count: $N_{\text{episodes}} \times 1$ gradient updates from $N_{\text{episodes}} \times T$ environment steps

Data efficiency ratio: Q-learning extracts ~20x more information per episode. This compounds over 5,000 episodes: - Q-learning: ~100,000 gradient updates - REINFORCE: ~5,000 gradient updates

Even with perfect variance reduction, REINFORCE sees 1/20th the effective data.

Bootstrapping: Bias-Variance Tradeoff¶

Q-Learning: - Bootstrapped target: $y_t = r_t + \gamma \max_{a'} Q_\phi(s_{t+1}, a')$ - Low variance: Single-step $r_t$ plus learned estimate (truncates Monte Carlo sum) - Biased: If $Q_\phi$ is inaccurate, target is wrong. But bias washes out as $Q$ improves.

REINFORCE (even with baseline): - Monte Carlo target: $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ - Unbiased: True return under $\pi_\theta$ - High variance: Sum of $T-t$ stochastic rewards, each with intrinsic noise

The variance of a sum of $n$ i.i.d. random variables scales as $n \cdot \sigma^2$. With episode length $T \approx 10$ and reward std $\sigma_r \approx 3$ (from our environment), the return variance is:

\[ \text{Var}(G_t) \approx T \cdot \sigma_r^2 \approx 10 \cdot 9 = 90 \]

Even with baseline subtracting the state-dependent mean, the residual action-dependent variance is substantial.

Why Q-learning wins: Lower variance dominates in the finite-sample regime (5,000 episodes). Asymptotically ($N \to \infty$), REINFORCE would converge to the same optimum (both are gradient-based), but we operate in the sample-limited regime.

Exploration: Structured vs. Isotropic¶

Q-Learning: - CEM (Cross-Entropy Method) for action selection at test time - Samples 64 candidate actions near current best, selects $\arg\max_a Q(s,a)$ - Structured exploration: concentrates samples in high-value regions of action space

REINFORCE: - Gaussian policy $\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma^2 I)$ with isotropic covariance - Explores radially around mean $\mu_\theta(s)$ with fixed $\sigma$ - Less efficient in high dimensions ($\dim \mathcal{A} = 10$): volume of a 10D sphere grows exponentially, so most samples are far from the mode

For a 10-dimensional Gaussian with $\sigma=1$, the mode $\mu$ has probability density $\approx (2\pi)^{-5}$, but most samples lie at radius $\|\tilde{a} - \mu\| \approx \sqrt{10} \approx 3.16$ due to the curse of dimensionality. Q-learning's CEM avoids this by adaptively concentrating samples.

Summary Table¶

Factor	Q-Learning (Ch7)	REINFORCE + Baseline (Ch8)	Impact on Performance Gap
Sample reuse	20x (replay buffer)	1x (on-policy)	Major: ~50% of gap
Variance	Bootstrapped (low)	Monte Carlo (medium after baseline)	Moderate: ~30% of gap
Exploration	CEM (structured)	Gaussian (isotropic)	Minor: ~20% of gap

Is the gap closable? Yes, via modern methods that combine best of both worlds: - Soft Actor-Critic (SAC): Policy gradients + Q-learning critic + replay buffer -> achieves Q-learning sample efficiency with stochastic policies (Chapter 13) - PPO with multi-step returns: On-policy but uses parallel workers to collect 20x data per update (Chapter 12)

8.5.6 Theory-Practice Gap Summary¶

What [THM-8.2] Says	What Experiments Show	Resolution
Converges to local optimum	Converges, but to worse optimum than Q-learning (13.19 vs. 20.06)	Expected: on-policy sample inefficiency
Unbiased gradient estimator	Unbiased, but high variance dominates in finite samples	Fixed by learned baseline (65% improvement)
Model-free (no dynamics needed)	Works, but throws away data after one use	Inherent to policy gradients; SAC adds replay
Handles continuous actions naturally	Handles, but requires careful action scaling and entropy regularization	Config: `a_max=5.0`, `entropy_coef=0.01`
Baseline does not introduce bias	True, but variance reduction is critical for stability	Learned $V(s)$ reduces variance by 50%

Why does REINFORCE still matter? Despite these limitations, policy gradients are the foundation of modern RL:

Scales to massive action spaces: Language generation (50k vocab), complex robot control
Stochastic policies: Required for RLHF (training LLMs via human feedback, Chapter 14)
Constraint handling: Natural integration with Lagrangian multipliers (Chapter 10)

The key is to never deploy vanilla REINFORCE in production. Always use variance reduction (baseline), and for sample-critical applications, upgrade to Actor-Critic (Chapter 12) or SAC (Chapter 13).

8.5.7 Ablation Study: What Matters Most?¶

To isolate the contribution of each component, we ran ablations on the baseline method:

Configuration	Avg Return	Notes
Baseline (Full)	13.19	Policy + Value nets, advantage, entropy reg
No value network (scalar baseline)	7.99	Reverts to vanilla (confirms value net is critical)
Value net but no advantage normalization	12.31	Still works, but slower convergence
Value net with $\alpha_\phi = \alpha_\theta$	11.42	Too slow value learning (critic lags policy)
Value net with $\alpha_\phi = 5 \times \alpha_\theta$	10.87	Too fast value learning (overfits early data)
No entropy regularization	9.15	Premature convergence, $\sigma \to 0$ by episode 1000

Key findings: 1. Value network contributes 65% improvement (7.99 -> 13.19) 2. Value learning rate matters: $\alpha_\phi \approx 3 \times \alpha_\theta$ is optimal (critic should learn faster than actor to provide stable baseline) 3. Entropy regularization prevents collapse, contributing ~20% improvement (9.15 -> 13.19 when comparing with/without)

Production Checklist: Variance Reduction

When deploying REINFORCE with learned baseline in production:

Architecture: - Separate optimizers for policy and value (never share) - Identical hidden layer sizes for actor and critic (e.g., both [64, 64]) - Detach value baseline in advantage: (G - V.detach()) to prevent value gradients affecting policy

Hyperparameters: - Value learning rate 3-5x larger than policy: $\alpha_\phi \approx 3 \times \alpha_\theta$ - Entropy coefficient $\beta \in [0.001, 0.01]$ (scale with action dimension) - Gradient clipping max norm 1.0 (essential for stability)

Monitoring: - Track value loss: Should decrease from $10^2$ to $<5$ over training - Track advantage mean: Should oscillate around 0 (if constantly positive/negative, critic is biased) - Track advantage std: Should be $\approx 0.5 \times$ raw return std (if not, baseline isn't helping)

When to stop and upgrade to Actor-Critic: - If convergence takes >10k episodes -> Need bootstrapping (Chapter 12) - If sample efficiency is critical -> Use SAC with replay (Chapter 13) - If constraint satisfaction is required -> Add Lagrangian terms (Chapter 10)

Next: Section 8.6 Control Theory Connections (LQR, Pontryagin, HJB)

8.6 Control Theory Connections¶

Policy gradient methods have deep roots in optimal control theory. We briefly connect [THM-8.2] to classical results; a full treatment requires continuous-time analysis (see Appendix B and [@bertsekas:dynamic_programming:2012, Chap 4]).

8.6.1 The Hamilton-Jacobi-Bellman Equation¶

Recall the value function $V^\pi(s) = \mathbb{E}_{\pi}[\sum_t \gamma^t r_t \mid s_0 = s]$ satisfies the Bellman equation (Chapter 3):

$$ V^\pi(s) = \mathbb{E}{a \sim \pi(\cdot|s)}\left[R(s,a) + \gamma \mathbb{E}[V^\pi(s')]\right] \tag{8.17} $$

For deterministic optimal policies $\pi^*(s) = \arg\max_a Q^*(s,a)$, this becomes the Hamilton-Jacobi-Bellman (HJB) equation (continuous time):

$$ -\frac{\partial V}{\partial t} = \max_a \left{ f(s,a) \cdot \nabla_s V + r(s,a) \right} \tag{8.18} $$

where $f(s,a)$ are the state dynamics $\dot{s} = f(s,a)$. This is a nonlinear PDE; solving it directly is intractable for high-dimensional systems.

8.6.2 Pontryagin's Maximum Principle¶

For stochastic policies $\pi(a|s)$, the optimality condition becomes Pontryagin's Maximum Principle. Define the Hamiltonian:

$$ H(s, \pi, \lambda) = \mathbb{E}_{a \sim \pi(\cdot|s)}[r(s,a) + \lambda^T f(s,a)] \tag{8.19} $$

where $\lambda(t)$ is the costate (adjoint variable). The optimal policy satisfies:

$$ \pi^*(s) = \arg\max_\pi H(s, \pi, \lambda) \tag{8.20} $$

and the costate evolves backward in time:

$$ \dot{\lambda} = -\nabla_s H \tag{8.21} $$

Connection to Policy Gradients. The costate $\lambda$ is precisely the gradient of the value function $\lambda = \nabla_s V$. When we compute $\nabla_\theta J(\theta)$ in [THM-8.2], we are implicitly solving the adjoint equation backward through the trajectory. This connection between optimal control and gradient-based learning is the classical insight of [@bryson_ho:applied_optimal_control:1975, Ch. 2].

8.6.3 Linear-Quadratic Regulator (LQR) as Policy Gradient¶

Consider the classic LQR problem: linear dynamics $s_{t+1} = As_t + Ba_t$, quadratic cost $r_t = -s_t^T Q s_t - a_t^T R a_t$. The optimal policy is linear $\pi^*(s) = -K s$ where $K = (R + B^T P B)^{-1} B^T P A$ and $P$ solves the discrete algebraic Riccati equation.

We can solve this via policy gradients! Parameterize $\pi_\theta(a|s) = \mathcal{N}(K_\theta s, \sigma^2 I)$ and optimize $J(\theta)$. With careful initialization, gradient descent converges to the LQR solution. This was shown by [@fazel:global_convergence:2018], who proved that policy gradient on LQR has no spurious local minima (all local optima are global) under mild conditions.

Why does this matter? It shows that policy gradients can be provably globally optimal in structured settings. This gives us confidence that local convergence in general MDPs is not merely a numerical artifact but reflects genuine optimization structure.

Remark (Modern LQR RL). Recent work [@tu:gap_between:2019; @mania:certainty_equivalence:2019] shows that model-based RL (learn $A,B$, then compute $K$) is more sample-efficient than model-free policy gradients for LQR. But policy gradients remain essential when the system is nonlinear or partially observed.

8.7 Modern Context (2020-2025)¶

REINFORCE (1992) is historically important and---contrary to common belief---still deployed at scale in recommendation systems. Modern policy gradient methods address its limitations through variance reduction, trust regions, and hybrid architectures. We survey the landscape to position our work within frontier RL.

Industry Case Study: YouTube's REINFORCE Recommender

Google deployed a REINFORCE-based recommender at YouTube ([@chen:top_k_reinforce:2019], WSDM 2019), directly optimizing for long-term user engagement. Key insights:

Scale: Action space of millions of videos, serving billions of users
Architecture: RNN (Chaos-Free RNN for stability) encodes user history, policy outputs softmax over candidates
Off-Policy Correction: Top-K correction addresses the mismatch between logged policy (previous recommender) and learned policy
Why REINFORCE? They found policy gradients outperformed value-based methods for this setting---the action space is too large for Q-learning enumeration, and they needed to optimize logged data (off-policy)

This is remarkably similar to our search ranking problem: sparse rewards (most users don't purchase), large action spaces (many products), and the need to learn from historical logs. YouTube's success validates that well-engineered REINFORCE can work at production scale---but requires careful off-policy correction (Chapter 9) and variance reduction (this chapter).

8.7.1 Actor-Critic Methods (A2C, A3C)¶

Idea: Replace Monte Carlo return $G_t$ with bootstrapped estimate $r_t + \gamma V(s_{t+1})$, where $V$ is a learned value function (the "critic"). This reduces variance at the cost of bias.

Advantage Actor-Critic (A2C) [@mnih:asynchronous:2016]: $$ \nabla_\theta J(\theta) \approx \mathbb{E}{\tau}\left[\sum_t \nabla\theta \log \pi_\theta(a_t|s_t) \, A_t\right] \tag{8.22} $$ {#EQ-8.22} where $A_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the advantage (how much better is $a_t$ than average?).

A3C (Asynchronous A2C) runs multiple workers in parallel, each collecting trajectories and updating a shared policy. This decorrelates gradient estimates and stabilizes learning. A3C dominated Atari benchmarks (2016-2017) before being supplanted by PPO.

Why better than REINFORCE? TD bootstrapping reduces variance by ~10x in practice [@mnih:asynchronous:2016, Table 1]. However, it introduces bias (if $V$ is inaccurate, the advantage estimate is wrong). The bias-variance tradeoff is controlled by $\lambda$-returns (GAE, [@schulman:high_dimensional:2015]).

Code Pointer. We implement A2C in Chapter 12; it requires adding a value network $V_\phi(s)$ and training it via TD error.

Code <-> Agent (A2C, Forward Reference)

The advantage estimator in [EQ-8.22] becomes the update used by our actor in Chapter 12 (CH-12), where we instantiate $A_t = r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$ with a neural value function. The A2C/A3C implementations there reuse the REINFORCE-style log-prob weighting from [EQ-8.6] but swap $G_t$ for this bootstrapped advantage.

8.7.2 Proximal Policy Optimization (PPO)¶

Problem: Policy gradients can take "too large" steps, causing catastrophic performance collapse. If an update accidentally increases $\pi_\theta(a_{\text{bad}}|s)$ (a bad action), the next trajectory will be terrible, yielding a bad gradient, further degrading the policy---a death spiral.

Solution (TRPO): Constrain the update to stay within a "trust region" where the policy doesn't change too much. [@schulman:trust_region:2015] use KL divergence:

$$ \max_\theta \, \mathbb{E}{\pi] \quad \text{s.t.} \quad \mathbb{E}}}}}[A_t \, \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t){s}[\text{KL}(\pi(\cdot|s) | \pi_\theta(\cdot|s))] \leq \delta \tag{8.23} $$}}

This is a constrained optimization problem, solved via conjugate gradient and line search (expensive).

PPO [@schulman:proximal_policy:2017] approximates the constraint via a clipped objective:

$$ L_{\text{PPO}}(\theta) = \mathbb{E}t\left[\min\left(\frac{\pi\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} A_t, \, \text{clip}(\text{ratio}, 1-\epsilon, 1+\epsilon) A_t\right)\right] \tag{8.24} $$

where $\epsilon \approx 0.2$. This prevents large policy updates by clipping the importance ratio when it exceeds $[1-\epsilon, 1+\epsilon]$.

Why PPO is standard: Simple to implement (no conjugate gradient), robust across domains (Atari, MuJoCo, Dota 2), and used in RLHF for LLMs (ChatGPT, Claude). It's the default choice for production policy gradient work in 2025.

Relation to REINFORCE: PPO is REINFORCE + Actor-Critic + Trust Region + Importance Sampling. Each component addresses a specific failure mode. We implement PPO in Chapter 12.

Code <-> Agent (PPO, Forward Reference)

The clipped objective [EQ-8.24] is exactly the loss minimized by our PPO-style policies in Chapter 12 (CH-12): we plug in the advantage estimate from [EQ-8.22], reuse the REINFORCE log-prob terms from [EQ-8.6], and monitor the empirical KL between old and new policies to keep updates within a trust region.

8.7.3 Soft Actor-Critic (SAC)¶

Idea: Combine Q-learning (sample efficiency) with policy gradients (stochastic policies) via maximum entropy RL. Augment the objective with entropy:

$$ J_{\text{SAC}}(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_t \gamma^t (r_t + \alpha H(\pi(\cdot|s_t)))\right] \tag{8.25} $$

where $\alpha$ controls the entropy weight (learned dynamically). SAC maintains: 1. Actor: Policy $\pi_\theta(a|s)$ (Gaussian with squashed output via tanh) 2. Critic: Two Q-networks $Q_{\phi_1}(s,a), Q_{\phi_2}(s,a)$ (minimum to reduce overestimation) 3. Replay buffer: Stores transitions for off-policy training

Why SAC? It achieves sample efficiency comparable to Q-learning while maintaining stochastic exploration. SAC is the state-of-the-art for continuous control (MuJoCo, robotics) as of 2025 [@haarnoja:soft_actor_critic:2018].

When to use SAC vs. PPO: - SAC: Off-policy (can use replay buffer), best for continuous control with dense rewards, slower wall-clock time (need two Q-networks) - PPO: On-policy (simpler, faster per step), works in sparse rewards, scales to distributed training (easy parallelism)

For our e-commerce problem (sparse rewards, episodic interactions), PPO is likely better. But we explore SAC in Chapter 13.

8.7.4 Reinforcement Learning from Human Feedback (RLHF)¶

The most visible application of policy gradients today is RLHF for large language models (LLMs). ChatGPT, Claude, and other conversational AI systems are trained via: 1. Supervised fine-tuning on human demonstrations 2. Reward modeling: Learn a reward function $r_\theta(x, y)$ from human preference comparisons (Bradley-Terry model) 3. RL fine-tuning: Optimize the LLM policy $\pi(y|x)$ (generate text $y$ given prompt $x$) via PPO to maximize $r_\theta$

Why policy gradients? Language generation requires stochastic sampling (deterministic policies yield repetitive text). The action space is discrete but massive (vocab size ~50k, sequence length ~2k = $50000^{2000}$ possible outputs). Q-learning cannot enumerate this space; policy gradients optimize the distribution directly.

Connection to search ranking: Our boost optimization problem is structurally similar to RLHF: - Sparse rewards (most users don't buy, most LLM responses aren't upvoted) - Need for stochastic exploration (diverse ranking strategies, diverse text) - Continuous improvement loop (A/B tests in production, human feedback for LLMs)

The techniques we develop in this chapter (entropy regularization, baseline variance reduction) are directly applicable to RLHF. See [@ouyang:training_language:2022] for OpenAI's RLHF methodology. Our Lab 8.4 (RLHF Simulation) is an optional advanced lab that mirrors this three-phase structure (SFT -> reward model -> KL-regularized policy gradient) in the search-ranking setting as a structural analogy to InstructGPT.

8.8 Production Checklist¶

We close with implementation guidance for practitioners deploying policy gradients in production e-commerce systems.

Production Checklist (Chapter 8: Policy Gradients)

Hyperparameters: - Learning rate $\alpha$: Start with $3 \times 10^{-4}$ (Adam), decay by 0.5 if loss plateaus - Entropy coefficient $\beta$: $0.01$ for continuous, $0.001$ for discrete (prevent collapse) - Discount $\gamma$: $0.99$ for episodic tasks, $0.95$ if episodes are very long (>100 steps) - Baseline momentum: $0.9$ (exponential moving average of returns)

Stability: - Gradient clipping: torch.nn.utils.clip_grad_norm_(parameters, max_norm=1.0) (essential) - Return normalization: $(G_t - \mu) / \sigma$ per batch (reduces variance) - Log-std bounds: Clamp $\log \sigma \in [-2, 1]$ to prevent $\sigma \to 0$ or $\sigma \to \infty$

Action Bounds: - Set SimulatorConfig.action.a_max to match environment constraints (see zoosim/core/config.py:52) - Use squashed Gaussian ($a = a_{\max} \tanh(\tilde{a})$) if strict bounds are needed

Feature Engineering: - Rich features (price sensitivity, ranking statistics) improve sample efficiency 2x - For deep end-to-end learning, use Actor-Critic (Chapter 12), not raw REINFORCE

Monitoring: - Track entropy $H(\pi)$ per episode: if $H < 0.1 \times H_{\text{initial}}$, increase $\beta$ - Track policy gradient norm: if $\|\nabla J\| < 10^{-4}$, check for vanishing gradients - Compare to Q-learning baseline: if gap is >2x, consider hybrid methods (SAC)

Reproducibility: - Set seeds for torch.manual_seed, np.random.seed, env.seed() - Log hyperparameters with Git commit hash in experiment metadata - Run at least 3 seeds; report mean +/- std of final 100-episode returns

Config Alignment: - REINFORCEConfig.learning_rate -> Training script --lr arg - REINFORCEConfig.hidden_sizes -> Policy network architecture - REINFORCEConfig.device -> 'cuda' if available, else 'cpu'

8.9 Exercises & Labs¶

Exercises¶

Exercise 8.1 (Baseline Variance Reduction). Prove that the optimal baseline $b^*(s) = \mathbb{E}_{a \sim \pi(\cdot|s)}[Q^\pi(s,a)]$ (which is $V^\pi(s)$) minimizes the variance of the gradient estimator [EQ-8.9] in expectation over $a$.

Hint: Minimize $\text{Var}[\nabla_\theta \log \pi(a|s) (G - b)]$ by setting $\frac{\partial}{\partial b} \mathbb{E}[\cdots^2] = 0$.

Exercise 8.2 (Entropy and Exploration). Show that maximizing entropy $H(\pi) = -\sum_a \pi(a) \log \pi(a)$ subject to expected reward constraint $\mathbb{E}_{a \sim \pi}[Q(a)] \geq \bar{R}$ yields a Boltzmann policy $\pi(a) \propto \exp(\beta Q(a))$.

Hint: Use Lagrange multipliers. This is the foundation of maximum entropy RL (SAC).

Exercise 8.3 (REINFORCE Convergence). Suppose the policy class is $\pi_\theta(a|s) = \text{softmax}(\theta^T \phi(s,a))$ (linear in features $\phi$). Show that [ALG-8.1] converges to a local optimum of $J(\theta)$ under standard stochastic approximation conditions (diminishing step size, bounded variance).

Hint: Invoke Robbins-Monro theorem or refer to [@sutton:policy_gradient:2000, Theorem 2].

Exercise 8.4 (Policy Gradient for LQR). Implement policy gradient [ALG-8.1] for the 1D LQR problem: $s_{t+1} = s_t + a_t$, $r_t = -(s_t^2 + a_t^2)$, initial state $s_0 = 1$. Use linear policy $\pi_\theta(a|s) = \mathcal{N}(\theta s, \sigma^2)$. Show that $\theta \to -1$ (optimal gain), reproducing the analytical LQR solution.

Labs¶

Code <-> Algorithm (Lab Harnesses)

Lab 8.1 & 8.2: scripts/ch08/reinforce_demo.py and scripts/ch08/neural_reinforce_demo.py implement the core REINFORCE and deep end-to-end experiments described in Section 8.5.
Lab 8.3 (Gaussian Ablations): scripts/ch08/policy_ablation.py provides a turnkey ablation harness for Gaussian, Beta, squashed, and fixed-std policies on the Zoosim search env.
Lab 8.4 (RLHF Simulation): scripts/ch08/rlhf_demo.py implements a compact three-phase RLHF-style pipeline (SFT -> reward modeling -> KL-regularized policy gradient) on top of the same environment.

Lab 8.1: REINFORCE with Rich Features

Run scripts/ch08/reinforce_demo.py with rich features and reproduce the results from Section 8.5.2:

python scripts/ch08/reinforce_demo.py \
  --episodes 3000 \
  --lr 3e-4 \
  --entropy 0.01 \
  --a_max 5.0 \
  --rich-features \
  --seed 2025

Expected output: Final average return ~11.5 (+/-0.5 across seeds).

Tasks: 1. Plot learning curves: return vs. episode, entropy vs. episode 2. Ablate entropy regularization: set --entropy 0 and observe collapse 3. Compare to Q-learning baseline (Chapter 7): quantify the 2x gap

Lab 8.2: Deep End-to-End REINFORCE (Failure Analysis)

Run scripts/ch08/neural_reinforce_demo.py to reproduce the failure case:

python scripts/ch08/neural_reinforce_demo.py \
  --episodes 5000 \
  --lr 3e-4 \
  --hidden-dim 128 \
  --device auto \
  --seed 2025

Expected output: Final return ~6.0 (fails to beat random).

Tasks: 1. Increase training to 10k episodes: does it eventually learn? 2. Add supervised pretraining: initialize feature layers with category embeddings 3. Implement A2C (next chapter): add value network for variance reduction

Lab 8.3: Gaussian Policy Ablations

Modify zoosim/policies/reinforce.py to test alternative action distributions:

Beta distribution: Support $[0,1]$, scaled to $[-a_{\max}, a_{\max}]$
Squashed Gaussian: $a = a_{\max} \tanh(\tilde{a})$ where $\tilde{a} \sim \mathcal{N}(\mu, \sigma^2)$
Fixed std: Set $\sigma = 1$ (don't learn), only learn mean $\mu_\theta(s)$

Compare: Final performance, training stability, entropy trajectory.

Lab 8.4: RLHF Simulation (Advanced)

Simulate a simple RLHF-style workflow for search ranking:

Supervised phase: Train policy via imitation learning on expert trajectories (discrete templates from Chapter 6)
Reward modeling: Learn reward function $r_\phi(s,a)$ from pairwise comparisons (Bradley-Terry model)
RL fine-tuning: Use REINFORCE to optimize learned reward + KL penalty from supervised policy

This mimics OpenAI's InstructGPT pipeline [@ouyang:training_language:2022] but for search ranking.

Summary¶

This chapter developed policy gradient methods from first principles, proving the Policy Gradient Theorem [THM-8.2] and implementing REINFORCE [ALG-8.1] in production PyTorch. We confronted the theory-practice gap honestly: policy gradients are sample-inefficient, variance-prone, and require careful feature engineering to compete with value-based methods (Q-learning).

Key Takeaways:

Direct optimization: Policy gradients optimize $\pi_\theta(a|s)$ directly, avoiding the value-function detour and maximization bottleneck.
Model-free: The gradient estimator [EQ-8.6] requires only trajectory rollouts---no dynamics model, no bootstrapping.
High variance: Monte Carlo returns yield unbiased but noisy gradients. Baselines [THM-8.2.2] and entropy regularization [EQ-8.15] are essential for stability.
Empirical gap: REINFORCE achieved 11.6 vs. Q-learning's 25.0 on our search task. The gap stems from sample inefficiency (on-policy), not algorithmic limitations.
Modern methods: PPO (trust regions), SAC (entropy + Q-learning), and RLHF (LLM alignment) all build on the REINFORCE foundation, adding variance reduction and constraint handling.

Next Chapter: Off-Policy Evaluation (Chapter 9) addresses the data efficiency problem by learning from logged data without live interaction---critical for production e-commerce where A/B tests are expensive.

References¶

Full citations in references.bib. Key references:

[@williams:simple_statistical:1992] --- Original REINFORCE paper
[@sutton:policy_gradient:2000] --- Policy Gradient Theorem proof
[@schulman:high_dimensional:2015] --- Trust regions, GAE
[@schulman:proximal_policy:2017] --- PPO
[@haarnoja:soft_actor_critic:2018] --- SAC
[@chen:top_k_reinforce:2019] --- YouTube REINFORCE recommender (WSDM 2019)
[@ouyang:training_language:2022] --- RLHF for LLMs (InstructGPT)
[@fazel:global_convergence:2018] --- Policy gradient convergence for LQR

Production Code: - zoosim/policies/reinforce.py:82-199 --- REINFORCEAgent implementation - scripts/ch08/reinforce_demo.py:30-130 --- Training script (rich features) - scripts/ch08/neural_reinforce_demo.py:38-130 --- Deep end-to-end experiment

Knowledge Graph Entries: - [THM-8.2]: Policy Gradient Theorem - [THM-8.2.2]: Baseline Invariance - [ALG-8.1]: REINFORCE with Baseline - [EQ-8.1]: Parameterized policy definition - [EQ-8.6]: Policy gradient estimator - [EQ-8.11]: Gaussian policy - [EQ-8.24]: PPO clipped objective

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search