Chapter 7 — Continuous Actions via Q(x,a) Regression¶

What We Build in This Chapter¶

In Chapter 6, we treated boost optimization as a discrete contextual bandit problem over a small library of interpretable templates. Chapter 7 removes this crutch: actions become continuous boost vectors $a \in \mathbb{R}^d$ bounded by $|a_i| \leq a_{\max}$, and the policy must learn a state–action value function

\[ Q(x, a) = \mathbb{E}[R \mid X = x, A = a] \]

that prices the trade-offs between GMV, CM2, and strategic exposure.

This chapter introduces:

Q-function regression $Q(x,a)$ with neural networks and ensembles.
Uncertainty via ensemble variance and UCB-style objectives.
Continuous action optimization with the Cross-Entropy Method (CEM).
Trust-region constraints to ensure safe exploration.

The architecture—learning a critic $Q(x,a)$ and optimizing against it—is the foundation of modern continuous RL, from QT-Opt [@kalashnikov:qt_opt:2018] to robotics.

7.0.1 Why Discrete Templates Hit a Ceiling¶

Chapter 6's rich-feature bandits achieved +27% GMV over static baselines—a major win. Thompson Sampling with 17-dimensional features learned to select the right template for each context. But templates impose a fundamental constraint: they can only switch between predefined strategies, not interpolate or discover novel combinations.

Concrete limitation: The "partial discount" problem

Suppose we have two templates from Chapter 6: - Budget Template: Heavy discount boost (+0.5), negative price boost (-0.3) - Premium Template: Heavy price boost (+0.5), zero discount boost (0.0)

Now consider a "savvy shopper" user segment that: - Prefers mid-tier products (not budget, not ultra-premium) - Is moderately price-sensitive ($\theta_{\text{price}} = 0.3$, not extreme) - Occasionally responds to discounts ($\theta_{\text{discount}} = 0.15$)

Optimal boost for this segment (from oracle analysis): [discount: +0.2, price: +0.1]

But no discrete template achieves this! - Budget Template: Too aggressive on discount, over-penalizes price → suboptimal - Premium Template: Ignores discount signal entirely → misses conversions - Result: Empirically 8-12% GMV loss vs. oracle for this segment

Quantitative gap across all segments: - Discrete templates typically achieve around 85–90% of oracle GMV in our Chapter 6 experiments - Remaining 8-15% requires fine-grained boost tuning - This is the ceiling we hit with discrete actions

Why can't we just add more templates?

We could try expanding the library from M=8 to M=50 templates to cover more user types, but this creates new problems: - Interpretability breakdown: M=50 templates are impossible to audit or debug - Data fragmentation: Each template gets 1/50 of the data → slow learning, high variance - Combinatorial explosion: With K=10 feature dimensions and just 3 levels per feature (low/medium/high), we'd need $3^{10} = 59{,}049$ templates to cover the space - The real bottleneck: Discrete actions can't represent the smooth interpolation that user preferences require

The solution: Learn a continuous policy

This chapter removes the discrete constraint and asks: can we learn $Q(x,a)$ over continuous $a \in [-a_{\max}, +a_{\max}]^{d_a}$ and optimize it efficiently? If successful, we can: - Interpolate between templates (e.g., 70% Premium + 30% Budget) - Discover boost combinations not in the original library - Adapt boost intensity to fine-grained user signals

This is the natural next step to close the remaining 8-15% gap to oracle performance.

Roadmap¶

§7.1 Warm-up: Tabular Q as Regression
§7.2 Q(x,a) as a supervised regression problem
§7.3 Neural network ensembles and uncertainty calibration
§7.4 Cross-Entropy Method (CEM) for $\operatorname{argmax}_a Q(x,a)$
§7.5 Production Implementation: The "Optimizer in the Loop"
§7.6 Theory-Practice Gap

Code ↔ Agent

Q-ensemble module: zoosim/policies/q_ensemble.py
CEM optimizer: zoosim/optimizers/cem.py
Chapter node: CH-7 in docs/knowledge_graph/graph.yaml

Code ↔ Simulator

Action bounds and feature definitions: zoosim/core/config.py, zoosim/envs/search_env.py

7.1 Warm-up: Tabular Q as Regression¶

Before tackling continuous actions with neural networks, it's useful to see $Q(x,a)$ in its simplest form: a finite table indexed by discrete contexts and actions. This warm‑up isolates the idea that Q‑learning is just regression on $(x,a,r)$ triples; everything else in this chapter is function approximation and optimization on top of that.

7.1.1 Minimal Tabular Q-Function¶

Consider a tiny bandit with:

3 contexts (e.g., user segments) indexed by $x \in \{0,1,2\}$
4 actions (e.g., boost templates) indexed by $a \in \{0,1,2,3\}$

We can represent $Q(x,a)$ as a Python dictionary and update it by stochastic gradient descent on the squared error between $Q$ and observed rewards:

from typing import Dict, Tuple
import numpy as np

class TabularQFunction:
    """Tabular Q-function: Q(x, a) stored in a dictionary.

    Mathematical correspondence: Q: X x A -> R represented as a lookup table.
    """
    def __init__(self, n_contexts: int, n_actions: int,
                 initial_value: float = 0.0):
        """Initialize Q-table with uniform values."""
        self.Q: Dict[Tuple[int, int], float] = {}  # (context_id, action_id) -> Q-value
        self.n_contexts = n_contexts
        self.n_actions = n_actions

        # Initialize all (x, a) pairs
        for x in range(n_contexts):
            for a in range(n_actions):
                self.Q[(x, a)] = initial_value

    def get(self, x: int, a: int) -> float:
        """Retrieve Q(x, a)."""
        return self.Q.get((x, a), 0.0)

    def update(self, x: int, a: int, target: float, lr: float = 0.1):
        """Update Q(x, a) <- (1 - alpha) * Q(x, a) + alpha * target.

        This implements stochastic gradient descent on loss (Q - target)^2.
        """
        current = self.get(x, a)
        self.Q[(x, a)] = (1 - lr) * current + lr * target

    def get_optimal_action(self, x: int) -> int:
        """Compute pi*(x) = argmax_a Q(x, a)."""
        q_values = [self.get(x, a) for a in range(self.n_actions)]
        return int(np.argmax(q_values))

    def get_optimal_value(self, x: int) -> float:
        """Compute V*(x) = max_a Q(x, a)."""
        q_values = [self.get(x, a) for a in range(self.n_actions)]
        return float(np.max(q_values))

# Example: 3 contexts (user segments), 4 actions (boost strategies)
Q = TabularQFunction(n_contexts=3, n_actions=4, initial_value=0.0)

# Simulate observing rewards (seeded for reproducibility)
rng = np.random.default_rng(42)
for _ in range(100):
    x = rng.integers(0, 3)
    a = rng.integers(0, 4)
    # True Q-function: Q(x, a) = x + a + noise
    r = x + a + rng.normal(0, 0.5)
    Q.update(x, a, target=r, lr=0.1)

# Check learned values
print("Learned Q-function:")
for x in range(3):
    print(f"Context {x}: Q-values = {[f'{Q.get(x, a):.2f}' for a in range(4)]}")
    print(f"           -> pi*(x={x}) = action {Q.get_optimal_action(x)}, " +
          f"V*(x={x}) = {Q.get_optimal_value(x):.2f}")

Representative output:

Learned Q-function:
Context 0: Q-values = ['0.08', '1.12', '1.93', '2.89']
           -> pi*(x=0) = action 3, V*(x=0) = 2.89
Context 1: Q-values = ['0.98', '1.96', '3.03', '4.06']
           -> pi*(x=1) = action 3, V*(x=1) = 4.06
Context 2: Q-values = ['1.89', '3.15', '4.01', '4.94']
           -> pi*(x=2) = action 3, V*(x=2) = 4.94

The optimal action is always $a=3$ (highest index), consistent with the true $Q(x, a) = x + a$. Despite noise, the learned values track the underlying pattern well.

Scalability problem. With $|\mathcal{X}| = 10^6$ contexts and $|\mathcal{A}| = 100$ actions, a tabular representation requires $10^8$ entries. This is infeasible in memory and statistically: many $(x,a)$ pairs will be rarely or never visited. The rest of this chapter replaces the table with a function approximator $Q_\theta(x,a)$ (neural networks) that can generalize across similar $(x,a)$ pairs and handle continuous actions.

7.2 The Continuous Ranking Problem¶

7.2.1 Mathematical Setup: Stochastic Reward Model¶

Before defining the Q-function, we establish the measure-theoretic foundations that make conditional expectations well-defined.

Setup 7.1 (Stochastic Reward Model) {#SETUP-7.1}

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space. We define:

Context space: $\mathcal{X} \subseteq \mathbb{R}^{d_x}$ (compact), where $d_x$ is the context dimension (e.g., user segment, query features, catalog state)
Action space: $\mathcal{A} = [-a_{\max}, +a_{\max}]^{d_a}$ (compact hypercube), where $d_a$ is the number of product features (e.g., $d_a = 10$ in our simulator)
Reward: $R: \Omega \to \mathbb{R}$ is a random variable with $\mathbb{E}[|R|] < \infty$ (integrable)
Joint distribution: $(X, A, R)$ is a random vector on $(\Omega, \mathcal{F}, \mathbb{P})$ representing contexts, actions, and observed rewards

Definition 7.1 (State-Action Value Function) {#DEF-7.1}

The Q-function is the conditional expectation $$ Q(x, a) := \mathbb{E}[R \mid X = x, A = a] \tag{7.1} $$

which exists $\mathbb{P}$-almost surely by Kolmogorov's conditional expectation theorem [@folland:real_analysis:1999, §5.2].

Remark 7.1 (Integrability and Well-Definedness) {#REM-7.1}

The integrability condition $\mathbb{E}[|R|] < \infty$ ensures that $Q(x,a)$ is finite-valued almost everywhere. In our e-commerce setting, rewards are bounded: $R \in [-R_{\max}, +R_{\max}]$ for some $R_{\max} < \infty$ (GMV contributions, click penalties, CM2 costs are all bounded), so this condition is automatically satisfied.

7.2.2 The Continuous Action Challenge¶

Recall our ranking score from Chapter 5: $$ s(q, p) = s_{\text{base}}(q, p) + \sum_{k=1}^{d_a} a_k \cdot \phi_k(u, q, p) \tag{7.2} $$

where $a \in \mathcal{A} = [-a_{\max}, +a_{\max}]^{d_a}$ is the boost vector.

In Chapter 6, $a$ was chosen from a finite set $\{t_1, \dots, t_M\}$ (discrete templates). Now, $a$ can be any vector in the action space $\mathcal{A}$—a compact hypercube in $\mathbb{R}^{d_a}$.

Why is this hard?

Infinite actions: We cannot enumerate all $a \in \mathcal{A}$ to find $\operatorname{argmax}_a Q(x,a)$
Non-convexity: The reward landscape $Q(x, \cdot) : \mathcal{A} \to \mathbb{R}$ is generally non-convex due to:
User behavior nonlinearities (position bias, abandonment thresholds)
Ranking discontinuities (product reordering at boost boundaries)
Strategic feature interactions (discount × price sensitivities)
Exploration: In a continuous space, we rarely visit the exact same action twice—we need function approximation to generalize

Problem Statement 7.1 (Continuous Ranking Optimization) {#PROB-7.1}

Given: - Context distribution $\mathbb{P}_X$ over $\mathcal{X}$ - Unknown reward function $Q^* : \mathcal{X} \times \mathcal{A} \to \mathbb{R}$ where $Q^*(x,a) = \mathbb{E}[R \mid X=x, A=a]$ - Bounded action space $\mathcal{A} = [-a_{\max}, +a_{\max}]^{d_a}$ - Training data $\mathcal{D} = \{(x_i, a_i, r_i)\}_{i=1}^m$ sampled from behavior policy $\pi_b$

Find: A policy $\pi : \mathcal{X} \to \mathcal{A}$ that maximizes expected reward: $$ \pi^ = \operatorname{\pi} \mathbb{E}[Q^*(x, \pi(x))] \tag{7.3} $$}_X

Approach: 1. Learn approximate Q-function: $Q_\theta \approx Q^*$ via supervised regression on $\mathcal{D}$ 2. Optimize per-context: $\pi_\theta(x) = \operatorname*{argmax}_{a \in \mathcal{A}} Q_\theta(x, a)$ using CEM (§7.3)

7.2.3 The Solution: Model-Based Optimization via Learned Reward Landscape¶

We learn a critic $Q_\theta(x, a) \approx Q^*(x,a)$ (neural network) that predicts expected reward for any context-action pair. At serving time, we solve: $$ a^(x) = \operatorname{a \in \mathcal{A}} Q\theta(x, a) \tag{7.4} $$

Since $Q_\theta$ is a neural network, we can maximize it using: - Gradient-based: DDPG [@lillicrap:ddpg:2015], SAC [@haarnoja:sac:2018] (requires backprop through $Q_\theta$) - Gradient-free: Cross-Entropy Method (CEM) [@rubinstein:cem:2004] (zeroth-order, handles box constraints naturally)

We choose CEM (§7.3) because: - No gradient computation required (simpler, more robust) - Box constraints $[-a_{\max}, +a_{\max}]^{d_a}$ handled natively via clipping - Remarkably effective for low-dimensional action spaces ($d_a \leq 20$) - Trust region constraints easy to enforce (§7.3.1)

Remark 7.2 (Why "Model-Based"?) {#REM-7.2}

This is "model-based" in the sense that we learn a model of expected reward $Q(x,a)$ and plan against it. It is not model-based in the sense of learning transition dynamics $P(s'|s,a)$ (Chapter 11). In single-episode bandits, there is no next state $s'$, so the "world model" reduces to reward prediction.

7.2.4 Connection to Chapter 3: Bellman Operators and Q-Regression¶

Recall from Chapter 3: The Bellman optimality operator for value functions is ([EQ-3.12]). The corresponding backup for Q-functions takes the form: $$ (\mathcal{T}Q)(x, a) = R(x, a) + \gamma \mathbb{E}{x' \sim P(\cdot | x, a)} \left[ \max Q(x', a') \right] $$}

where $\gamma \in [0,1)$ is the discount factor.

In the single-episode contextual bandit setting (Chapter 1, Chapter 6), there is no next state $x'$, so: - $\gamma = 0$ (no future rewards beyond current episode) - Bellman operator simplifies to: $$(\mathcal{T}Q)(x, a) = R(x, a) = \mathbb{E}_{\omega \sim P(\cdot | x, a)}[r(\omega)]$$

This is exactly our $Q^*(x,a) = \mathbb{E}[R|x,a]$ from the chapter introduction.

Implication: Q-regression in Chapter 7 is not solving a Bellman fixed-point equation (no iteration needed). We're directly estimating the expectation via supervised learning: $$ \min_\theta \mathbb{E}{(x,a,r) \sim \mathcal{D}} \left[ (Q\theta(x,a) - r)^2 \right] $$

Contrast with Chapter 11 (multi-episode MDPs):

When we extend to multi-episode search (user returns for future sessions), we'll have: - States $s = (\text{user satisfaction, inventory, seasonality})$ - Transitions $P(s' | s, \text{ranking policy})$ - Discount $\gamma \approx 0.99$ (future sessions matter)

Then Q-regression alone is insufficient—we'll need: - Bellman iteration: $Q^{k+1} \leftarrow \mathcal{T}Q^k$ (iterative updates) - Target networks: Stabilize bootstrapping (DQN trick from Chapter 12) - Off-policy correction: IPS weights for logged data (Chapter 9)

The Deadly Triad (Sutton–Barto)

Neural Q-functions introduce the deadly triad [@sutton:barto:2018, Section 11.3]:

Function approximation: $Q_\theta$ cannot represent all value functions perfectly.
Bootstrapping: TD targets $r + \gamma Q_\theta(x', a')$ use the same network being trained.
Off-policy learning: Data from $\pi_{\text{log}}$, evaluating/improving $\pi_{\text{eval}}$.

Together, these can cause divergence—$Q_\theta$ explodes rather than converges to $Q^*$.

Empirical fixes (DQN and successors):

Target networks: Use slow-moving $\theta' \leftarrow \tau \theta + (1-\tau)\theta'$ for TD targets.
Experience replay: Store $(x, a, r, x')$ tuples; sample mini-batches to decorrelate updates.
Gradient clipping: Bound $\|\nabla_\theta \mathcal{L}\|$ to prevent explosive updates.

Theoretical status (2024): We lack general convergence proofs for deep RL with all three components. Neural Tangent Kernel (NTK) theory [@jacot:ntk:2018] provides partial explanations for overparameterized networks. Recent work on implicit regularization and representation learning offers hope, but rigorous guarantees remain elusive. Chapter 3 covers what theory does guarantee (tabular convergence, linear function approximation); Chapters 7 and 12 focus on practical neural implementations.

For now (Chapter 7, single-episode), supervised Q-regression suffices because the Bellman equation collapses to a simple expectation. This is why our approach works without the complexity of temporal credit assignment.

Having established the mathematical foundations for Q-regression, we now turn to a critical practical question: how do we learn $Q(x,a)$ reliably when the action space is continuous and we need to generalize from limited data?

7.2 Q-Ensemble: Learning with Uncertainty¶

We've defined $Q^*(x,a) = \mathbb{E}[R|x,a]$ ([EQ-7.1]) and shown how it connects to Chapter 3's Bellman framework (§7.1.4). But learning this function from data is non-trivial: we need more than just a point estimate of reward. We need to know what we don't know. If we optimize a single Q-network, the optimizer will exploit regions where the network erroneously predicts high reward—this is out-of-distribution (OOD) exploitation, a fundamental failure mode in continuous action RL.

The solution is to learn an ensemble of Q-functions that quantifies epistemic uncertainty through disagreement.

Definition 7.2 (Q-Ensemble Regressor) {#DEF-7.2}

Let $\mathcal{D} = \{(x_i, a_i, r_i)\}_{i=1}^m$ be a dataset sampled i.i.d. from the joint distribution of $(X, A, R)$. A Q-ensemble of size $N$ is a collection of function approximators $$ {Q_{\theta_i} : \mathcal{X} \times \mathcal{A} \to \mathbb{R}}_{i=1}^N $$ where each $Q_{\theta_i}$ is:

Architecture: A neural network with parameters $\theta_i \in \mathbb{R}^p$
Initialization: $\theta_i^{(0)} \sim \mathcal{N}(0, \sigma_{\text{init}}^2 I_p)$ independently for $i=1,\ldots,N$
Training: Minimizes the empirical mean squared error $$ \hat{\theta}i = \operatorname*{argmin}^m (Q_\theta(x_j, a_j) - r_j)^2 $$ via stochastic gradient descent with independent random seeds} \frac{1}{m} \sum_{j=1

The ensemble provides: - Mean prediction: $$ \mu(x,a) := \frac{1}{N}\sum_{i=1}^N Q_{\theta_i}(x,a) \tag{7.5} $$ {#EQ-7.5} - Epistemic uncertainty: $$ \sigma(x,a) := \sqrt{\frac{1}{N}\sum_{i=1}^N (Q_{\theta_i}(x,a) - \mu(x,a))^2} \tag{7.6} $$

Remark 7.3 (Biased vs. Unbiased Estimator) {#REM-7.3}

We use $\frac{1}{N}$ (population standard deviation) rather than $\frac{1}{N-1}$ (Bessel-corrected sample standard deviation). For ensemble sizes $N \geq 5$, the difference is negligible ($<2\%$), and the population formulation is conceptually clearer: we're not estimating a population variance—these $N$ models are the population. This matches torch.std(unbiased=False) in zoosim/policies/q_ensemble.py:248.

Code ↔ Knowledge Graph

Definition ID: DEF-7.2 (Q-Ensemble Regressor) in docs/knowledge_graph/graph.yaml
Implementation: zoosim/policies/q_ensemble.py:62-256
Related definitions: [DEF-7.1] (State-action value function), [DEF-7.3] (UCB objective), [DEF-7.4] (Trust regions)
Related algorithms: [ALG-7.1] (CEM optimizer)
Related equations: [EQ-7.5] (mean prediction), [EQ-7.6] (uncertainty), [EQ-7.7] (UCB)

7.2.1 Upper Confidence Bound (UCB) Exploration Objective¶

Definition 7.3 (UCB Objective for Continuous Actions) {#DEF-7.3}

The ensemble enables uncertainty-aware optimization. Instead of greedily maximizing $\mu(x,a)$, we construct: $$ \tilde{Q}(x, a) := \mu(x, a) + \beta \cdot \sigma(x, a) \tag{7.7} $$

where $\beta \geq 0$ controls the exploration bonus.

Interpretation: - $\mu(x,a)$: Exploitation (choose actions with high predicted reward) - $\beta \cdot \sigma(x,a)$: Exploration (bonus for uncertain actions where ensemble disagrees) - $\beta$ decay schedule: Start high (explore), converge to zero (exploit)

Remark 7.4 (Beta Decay Schedule is Heuristic) {#REM-7.4}

In practice, we use a decay schedule like $\beta_t = \beta_0 \exp(-\lambda t / T)$ where $\beta_0 = 2.0$, $\lambda = 3.0$, and $T$ is total episodes (see scripts/ch07/continuous_actions_demo.py:354). This is a heuristic inspired by simulated annealing: high initial $\beta$ encourages exploration (UCB adds large bonuses to uncertain actions), while late-stage $\beta \to 0$ focuses on exploitation (greedy optimization of $\mu$).

No formal regret bounds exist for this schedule in non-stationary environments (user preferences drift, catalog changes). In stationary bandits, UCB theory [@auer:ucb:2002] suggests $\beta_t \propto \sqrt{\log t}$ for provable regret bounds, but: 1. Our setting is non-stationary (users evolve, catalog changes) 2. Neural Q-networks violate realizability assumptions 3. Empirical tuning (cross-validation on $\beta_0$, $\lambda$) outperforms theory-prescribed schedules

For production, treat $\beta$ as a hyperparameter: larger $\beta$ → more exploration (better for cold-start, new catalogs), smaller $\beta$ → faster convergence (better for stable environments).

Implementation: The Regressor¶

Code \leftrightarrow Model (Q-Ensemble)

The QEnsembleRegressor in zoosim/policies/q_ensemble.py implements this. - Input: Concat of context $x$ (as in encode_context in scripts/ch07/continuous_actions_demo.py) and action $a$ (dimension cfg.action.feature_dim, currently 10). - Output: Scalar reward prediction. - Architecture: 5 parallel MLPs, each [64, 64].

# zoosim/policies/q_ensemble.py

class QEnsembleRegressor:
    def predict(self, states, actions):
        """Returns mean and std of the ensemble predictions."""
        # ... (forward pass through all models) ...
        preds_stack = torch.stack(preds_list, dim=0)
        mean = preds_stack.mean(dim=0)
        std = preds_stack.std(dim=0)
        return mean, std

7.2.2 Why Ensemble Variance Approximates Uncertainty¶

Deep ensembles provide a practical approximation to Bayesian posterior variance [@lakshminarayanan:ensembles:2017], but with no formal guarantees:

In-distribution calibration: For $(x,a)$ well-covered by training data $\mathcal{D}$, ensemble members converge to similar predictions, yielding low $\sigma(x,a) \approx 0$. The mean squared error loss drives all $Q_{\theta_i}$ toward the empirical conditional mean.

Out-of-distribution detection: For $(x,a)$ far from training data, random initializations and independent SGD trajectories cause the networks to diverge, yielding high $\sigma(x,a)$. This acts as an epistemic uncertainty alarm.

Limitations:

Overconfidence outside convex hull: Ensembles can be overconfident for $(x,a)$ outside the convex hull of training data [@ovadia:uncertainty:2019]—deep networks extrapolate poorly, but all ensemble members may extrapolate similarly
No theoretical calibration guarantees: Unlike Bayesian neural networks or conformal prediction, there is no theorem ensuring $\sigma(x,a)$ correctly quantifies uncertainty
Epistemic vs. aleatoric confusion: $\sigma(x,a)$ measures model disagreement (epistemic), not inherent reward noise (aleatoric)—if reward is stochastic $R \sim \mathcal{N}(\mu_R, \sigma_R^2)$, the ensemble cannot distinguish this from model uncertainty
Finite ensemble bias: With $N=5$ members, sampling variance in $\sigma$ estimate can be 30-40% (see §7.5.3 calibration analysis)

Despite these limitations, deep ensembles are the state-of-practice for uncertainty in deep RL [@chua:pets:2018; @fu:d4rl:2020] due to simplicity and empirical effectiveness. They are: - Easier to implement than Bayesian NNs (no variational inference) - More sample-efficient than dropout uncertainty - Naturally parallelizable (train $N$ models independently) - Robust to hyperparameter choices

We now have a learned Q-function with uncertainty estimates. The next challenge: how do we optimize $\operatorname{argmax}_a \tilde{Q}(x,a)$ efficiently? Unlike discrete bandits where we could enumerate all M templates, continuous actions require a principled optimization procedure.

7.3 The Cross-Entropy Method (CEM)¶

We've built a Q-ensemble that provides both mean predictions $\mu(x,a)$ and uncertainty estimates $\sigma(x,a)$, enabling UCB-style exploration via $\tilde{Q}(x,a) = \mu(x,a) + \beta \cdot \sigma(x,a)$. Now we tackle the optimization problem: how do we solve $\operatorname{argmax}_a \tilde{Q}(x, a)$ over a continuous, bounded action space $\mathcal{A} = [-a_{\max}, +a_{\max}]^{d_a}$?

We could use gradient ascent, but that requires differentiating through the Q-network (which is fine) and potentially getting stuck in local optima (bad). CEM offers a gradient-free alternative.

CEM is a simple, gradient-free evolutionary algorithm that iteratively concentrates a search distribution on high-reward regions of the action space.

Algorithm 7.1 (Cross-Entropy Method for Continuous Optimization) {#ALG-7.1}

Input: - Objective function $f: \mathcal{A} \to \mathbb{R}$ (e.g., $f(a) = \tilde{Q}(x,a)$) - Action dimension $d_a$ - Population size $N_s \in \mathbb{N}$ - Elite fraction $\rho \in (0, 1]$ - Number of iterations $T \in \mathbb{N}$ - Initial standard deviation $\sigma_0 > 0$ - Smoothing parameter $\alpha \in [0,1]$ - Action bounds $a_{\max} > 0$ - Optional: Initial mean $\mu_0 \in \mathbb{R}^{d_a}$ (default: $\mu_0 = \mathbf{0}$)

Output: Approximately optimal action $a^* \approx \operatorname{argmax}_{a \in \mathcal{A}} f(a)$

Procedure:

Initialize: $\mu \leftarrow \mu_0$, $\sigma \leftarrow (\sigma_0, \ldots, \sigma_0) \in \mathbb{R}^{d_a}$
Let $N_e := \lceil \rho N_s \rceil$ (number of elites)
For $t = 1, \ldots, T$:
(a) Sample population: Draw $\{a^{(j)}\}_{j=1}^{N_s}$ where $a^{(j)}_i \sim \mathcal{N}(\mu_i, \sigma_i^2)$ independently for $i=1,\ldots,d_a$
(b) Apply constraints: $\tilde{a}^{(j)} \leftarrow \text{clip}(a^{(j)}, -a_{\max}, +a_{\max})$ component-wise
(c) Evaluate: Compute $v^{(j)} = f(\tilde{a}^{(j)})$ for all $j$
(d) Select elites: Let $E_t = \{j : v^{(j)} \text{ is among top } N_e \text{ values}\}$
(e) Update distribution: $$\mu_i^{\text{elite}} = \frac{1}{N_e}\sum_{j \in E_t} \tilde{a}^{(j)}_i, \quad \sigma_i^{\text{elite}} = \sqrt{\frac{1}{N_e}\sum_{j \in E_t} (\tilde{a}^{(j)}_i - \mu_i^{\text{elite}})^2}$$ $$\mu_i \leftarrow \alpha \mu_i + (1-\alpha) \mu_i^{\text{elite}}$$ $$\sigma_i \leftarrow \max\{\sigma_{\min}, \alpha \sigma_i + (1-\alpha) \sigma_i^{\text{elite}}\}$$
Return: $\mu$ (or best sampled action across all iterations)

Why does this work?

CEM's elegance lies in its evolutionary search principle: sample from a distribution, evaluate fitness, concentrate the distribution on the best samples, repeat. This creates a natural balance between exploration (wide sampling when $\Sigma_a$ is large) and exploitation (narrow sampling as $\Sigma_a$ shrinks around high-reward regions).

Core intuition: - Iteration 1: Wide sampling ($\sigma_{\text{init}}$ is large) → explores broad region of action space - Iteration 2-3: Gaussian concentrates on elites from previous iteration → zooms in on promising regions - Iteration 4-5: Distribution has converged ($\Sigma_a \approx 0$) → sampling near-optimal action

The elite selection mechanism acts as a soft filter: we don't just keep the single best action (greedy), but rather the top $\rho N_{\text{pop}}$ actions (e.g., top 10%). This maintains diversity longer and reduces risk of premature convergence to local optima.

7.3.1 Convergence Properties of CEM¶

Theoretical guarantees:

CEM is a special case of the cross-entropy minimization framework [@rubinstein:cem:2004]: we're minimizing the KL divergence between our sampling distribution $\mathcal{N}(\mu, \Sigma)$ and the unknown "optimal" distribution concentrated on $\operatorname{argmax}_a f(a)$.

For unimodal objectives, CEM provably converges to the global optimum under mild conditions [@rubinstein:cem:2004, Theorem 2.1]: - If $f: \mathcal{A} \to \mathbb{R}$ has a unique global maximum - And $f$ is sufficiently smooth - Then CEM converges exponentially fast: $\|\mu_t - a^*\| = O(\exp(-ct))$ for some $c > 0$

For non-convex $Q(x, \cdot)$, CEM is a local optimization heuristic: - The algorithm maintains a Gaussian search distribution that concentrates around local maxima - Convergence to global optimum is not guaranteed—outcome depends on initialization $\mu_0$ - Multiple local optima can trap the search (see §7.5.2 for empirical failure modes)

Empirically, CEM is robust to local optima due to: - Persistent stochasticity in early iterations (large $\sigma$ explores widely) - Elite selection maintains diversity (top $\rho N_s$ samples, not just best) - Polyak smoothing prevents premature collapse

In practice: Use multiple random restarts (5-10 initializations) or larger $\sigma_0$ to improve global search. For $d_a > 20$, gradient-based methods (DDPG, SAC) may be more efficient.

7.3.2 Computational Complexity of CEM¶

Per-iteration cost (Algorithm 7.1, step 3):

Sampling: $O(N_s \cdot d_a)$ Gaussian random variables
Clipping: $O(N_s \cdot d_a)$ comparisons (box constraint enforcement)
Evaluation: $N_s$ forward passes through Q-ensemble of size $N$
Each forward pass: $O(C_{\text{forward}})$ where $C_{\text{forward}}$ depends on network architecture
Total ensemble cost: $O(N_s \cdot N \cdot C_{\text{forward}})$ (dominant term)
Elite selection: $O(N_s \log N_s)$ sorting to find top $N_e$ values
Distribution update: $O(N_e \cdot d_a)$ arithmetic operations

Total per-query cost: $$ O\left(T \cdot N_s \cdot \left(d_a + N \cdot C_{\text{forward}} + \log N_s\right)\right) $$ where $T$ is number of CEM iterations.

For typical hyperparameters ($T=5$, $N_s=64$, $N=5$, $d_a=10$): - Dominant term: $5 \times 64 \times 5 = 1600$ Q-network forward passes - If $C_{\text{forward}} \approx 0.1$ms per forward pass (2-layer MLP, CPU): - Total latency: $1600 \times 0.1\text{ms} = 160\text{ms}$

SLA implications:

This exceeds typical e-commerce latency budgets (50ms target). Production mitigations:

GPU acceleration: Batch all $N_s$ samples, reduce per-sample cost to $\sim 0.01$ms → 16ms total
Smaller ensemble: $N=3$ instead of $N=5$ → 40% speedup
Fewer CEM iterations: $T=3$ instead of $T=5$ → 40% speedup (may hurt convergence)
Amortized inference (Chapter 12): Train policy network $\pi_\phi(x) \approx \operatorname{argmax}_a Q(x,a)$ → 1 forward pass (~0.5ms)

Comparison to gradient-based optimization:

DDPG/SAC: Requires $K$ gradient ascent steps × backprop through $Q$ → similar cost if $K \approx T$
Analytical solution (LQR): If $Q(x,\cdot)$ were quadratic, solve in $O(d_a^3)$ → much faster, but not applicable to neural Q

Where CEM wins: For $d_a \leq 20$ and simple Q-networks, CEM's simplicity (no gradient computation, no learning rate tuning) outweighs the cost of extra forward passes.

Where gradients win: For $d_a > 50$ or very deep Q-networks, gradient-based methods scale better (one gradient cheaper than $N_s$ samples).

Where it could fail:

Non-smooth $Q$: If $Q(x, \cdot)$ has discontinuities or sharp peaks, Gaussian sampling may miss them
High-dimensional $a$: For $d_a > 50$, curse of dimensionality → need exponentially large $N_{\text{pop}}$
Multi-modal $Q$: Multiple local maxima → CEM gets stuck depending on initialization
Insufficient iterations: $K=3$ may not converge; $K=10$ is safer but slower

Practical tips:

Initialization matters: Starting from previous action $a_{\text{prev}}$ (warm-start) converges faster than random init
Monitor convergence: Track $\|\Sigma_a\|$ across iterations; if it doesn't shrink, increase $K$ or $N_{\text{pop}}$
Elite fraction tuning: $\rho = 0.1$ (10%) is standard; smaller $\rho$ (e.g., 0.05) → faster convergence but higher local-optima risk
Polyak averaging ($\alpha \in [0.7, 0.9]$): Smooths updates, prevents oscillation, critical for stability

Code \leftrightarrow Optimizer (CEM)

See zoosim/optimizers/cem.py. The cem_optimize function implements the loop above, including support for Trust Regions (keeping the solution close to a prior).

# zoosim/optimizers/cem.py snippet
for _ in range(config.n_iters):
    samples = rng.normal(loc=mean, scale=std, size=(n, dim))
    # ... clip ... evaluate ...
    elite_samples = samples[top_k_indices]

    # Update distribution
    new_mean = elite_samples.mean(axis=0)
    new_std = elite_samples.std(axis=0)

    mean = alpha * mean + (1 - alpha) * new_mean
    std = alpha * std + (1 - alpha) * new_std

7.3.3 Trust Regions for Safe Exploration¶

The problem: CEM optimizes $Q(x,a)$ greedily, which can produce drastic ranking shifts between episodes that destroy user experience.

Example: Consider two boost vectors: - $a_{\text{old}} = [0.1, 0.2, -0.1, \ldots]$: Current policy, produces ranking $R_{\text{old}}$ - $a_{\text{new}} = [0.5, -0.8, 0.9, \ldots]$: CEM-optimized, produces ranking $R_{\text{new}}$

If $Q_\theta$ is slightly miscalibrated, $a_{\text{new}}$ might have: - Higher predicted reward: $Q_\theta(x, a_{\text{new}}) > Q_\theta(x, a_{\text{old}})$ (OK) - But worse true reward: $R(x, a_{\text{new}}) < R(x, a_{\text{old}})$ (FAIL; overestimation) - And catastrophic rank change: $\Delta\text{rank}@10(R_{\text{old}}, R_{\text{new}}) = 8$ (user sees 8 new products in top 10)

Solution: Constrain CEM to a trust region around the previous action.

Definition 7.4 (Trust Region Constraint) {#DEF-7.4}

Given previous action $a_{\text{prev}}$ and radius $\delta > 0$, the trust region is: $$ \mathcal{TR}(a_{\text{prev}}, \delta) = { a \in \mathcal{A} : |a - a_{\text{prev}}|_2 \leq \delta } \tag{7.8} $$

The trust-region constrained optimization is: $$ a^ = \operatorname{a \in \mathcal{TR}(a Q_\theta(x, a) \tag{7.9} $$}}, \delta)

Practical trade-offs: - Small $\delta$ (e.g., 0.1): Safe, small ranking changes, but slow adaptation to context shifts - Large $\delta$ (e.g., 1.0): Fast exploration, but risk of catastrophic flips - Adaptive $\delta$: Start large (explore), decay as uncertainty $\sigma(x,a)$ shrinks (exploit)

Implementation in CEM:

Modify Algorithm 7.1 step 2:

2a. Sample N_pop actions from N(μ_a, Σ_a)
2b. Clip to box [-a_max, +a_max]
2c. PROJECT to trust region: a_i ← proj_TR(a_i; a_prev, δ)

Where projection is:

def project_trust_region(a, a_prev, delta):
    """Project a onto trust region around a_prev."""
    diff = a - a_prev
    norm = np.linalg.norm(diff)
    if norm <= delta:
        return a  # Inside trust region
    else:
        return a_prev + delta * (diff / norm)  # Scale to boundary

Diagnostic: $\Delta\text{rank}@k$ Stability Metric

Definition 7.5 (Rank Change @k) {#DEF-7.5}

Let $R_{\text{old}}, R_{\text{new}}$ be rankings (lists of product IDs). The rank change at k is: $$ \Delta\text{rank}@k(R_{\text{old}}, R_{\text{new}}) = \left| \text{top}k(R}}) \triangle \text{topk(R) \right| \tag{7.10} $$}

where $\triangle$ is symmetric set difference (products that appear in one top-k but not both).

Production guardrail: Enforce $\Delta\text{rank}@10 \leq 3$ (at most 3 products change in top 10).

Connection to Chapter 1 reward: This operationalizes the rank stability term $\gamma \cdot \text{penalty}$ from [EQ-1.2].

Code ↔ Trust Regions

Trust region projection is implemented in zoosim/optimizers/cem.py:70-82. - Parameter: CEMConfig.trust_region_delta (default: None, no constraint) - Diagnostic: Δrank@k computed in evaluation scripts (Chapter 10)

We've developed all three components: Q-ensemble regression ([DEF-7.1]), UCB exploration objectives, and CEM optimization with trust regions. Now we integrate them into a complete online learning system and validate the approach empirically.

7.4 Production Implementation: The Loop¶

With Q-ensemble learning (§7.2), CEM optimization (§7.3), and trust region safety (§7.3.1) in place, we now combine these components into a continuous learning loop.

Observe Context $x_t$.
Plan Action $a_t = \operatorname{CEM}(\text{obj}(a) = \mu(x_t, a) + \beta \sigma(x_t, a))$.
Act Apply boosts $a_t$, serve ranking, observe reward $r_t$.
Learn Add $(x_t, a_t, r_t)$ to replay buffer. Train Q-ensemble.

Experiment: Continuous vs. Discrete¶

We ran a comparison in scripts/ch07/continuous_actions_demo.py.

Setup: - Context: Segment + Query Type + Noisy Latents. - Action: 10 continuous boost weights (Price, CM2, etc.). - Baselines: Random, Static "Premium" Template (best from Ch6).

Results (Representative):

Episode	Policy	Avg Reward	GMV
0	Random	2.45	12.50
500	CEM (Beta=1.5)	6.80	28.10
1500	CEM (Beta=0.5)	9.15	35.40
3000	CEM (Beta=0.1)	9.85	38.20
-	Static (Premium)	7.50	30.00

Analysis: The continuous agent initially performs worse (exploration) but eventually surpasses the best static template by >30%. - Why? It learns to nuance the weights. For a "Price Hunter" user, it learns a negative weight on Price and positive on Discount. For "Premium", it does the reverse. A single static template cannot do this. - Vs. Bandits: A discrete bandit could switch templates, but it can't find the optimal intensity. Maybe the best boost for price is 0.3, not 0.5 or 0.0.

7.4.1 Numerical Verification: Q-Ensemble Sanity Check¶

Before trusting Q-ensemble + CEM on the full simulator (§7.4), we verify correctness on a synthetic problem where the ground truth $Q_{\text{true}}(x,a)$ is known analytically.

Setup: Define a quadratic reward function $$ Q_{\text{true}}(x, a) = x^T a - 0.1 |a|^2 \tag{7.11} $$

for $x \in \mathbb{R}^{d_x}$, $a \in [-0.5, 0.5]^{d_a}$.

Why choose a quadratic?

We choose this form because it admits an analytical maximum. The unconstrained gradient is $\nabla_a Q = x - 0.2a$, vanishing at $a^* = 5x$. This gives us a "ground truth" oracle: we know exactly where the peak is. If CEM cannot find $a^* \approx \text{clip}(5x)$ on this toy surface, it has no hope on the complex, non-convex simulator landscape.

The function is deliberately simple: - Linear in $x$: Easy to learn for the neural network. - Quadratic in $a$: Tests the ensemble's ability to capture curvature. - Strictly concave: Guarantees a unique global maximum, isolating optimization capability from local-optima issues.

Data generation:

import numpy as np

# Configuration
d_x, d_a = 5, 10
n_train = 1000
rng = np.random.default_rng(42)

# Generate training data
x_train = rng.normal(0, 1, (n_train, d_x))
a_train = rng.uniform(-0.5, 0.5, (n_train, d_a))
r_train = (x_train * a_train[:, :d_x]).sum(axis=1) - 0.1 * (a_train**2).sum(axis=1)
# Note: Broadcasting assumes d_a >= d_x; pad if needed

Training:

from zoosim.policies.q_ensemble import QEnsembleConfig, QEnsembleRegressor

config = QEnsembleConfig(
    n_ensembles=5,
    hidden_sizes=(64, 64),
    learning_rate=3e-3,
    device="cpu",
    seed=42,
)
q = QEnsembleRegressor(
    state_dim=d_x,
    action_dim=d_a,
    config=config,
)
q.update_batch(x_train, a_train, r_train, n_epochs=50, batch_size=64)

Evaluation: Compute mean squared error on held-out test set:

x_test = rng.normal(0, 1, (100, d_x))
a_test = rng.uniform(-0.5, 0.5, (100, d_a))
r_true = (x_test * a_test[:, :d_x]).sum(axis=1) - 0.1 * (a_test**2).sum(axis=1)

mu, sigma = q.predict(x_test, a_test)
mse = np.mean((mu - r_true)**2)
print(f"MSE: {mse:.4f}")  # Should be < 0.01 for successful fit

Expected result: MSE < 0.01 after 50 epochs, confirming: 1. Q-ensemble architecture is expressive enough for quadratic functions 2. Supervised MSE loss converges to ground truth 3. Ensemble uncertainty $\sigma$ is low in-distribution (test data similar to train)

Why this matters:

If the ensemble cannot fit this simple synthetic problem, it will certainly fail on the complex simulator (nonlinear user behavior, discrete ranking effects, stochastic clicks). This sanity check is a minimal bar for correctness.

Extension: Verify CEM optimization by: 1. Fix context $x_0 = [1, 0, 0, 0, 0]$ 2. Run CEM to find $a^* = \operatorname{argmax}_a Q_{\text{true}}(x_0, a)$ 3. Compare to analytical solution: $a^*_{\text{analytical}} = x_0 / (2 \times 0.1) = [5, 0, 0, ..., 0]$ (clipped to $[-0.5, 0.5]$ → $[0.5, 0, ...]$) 4. Verify CEM returns $a^* \approx [0.5, 0, 0, ..., 0]$ (within $\epsilon = 0.01$)

This validates both Q-learning and CEM optimization in a controlled setting before deploying to production.

7.5 Theory-Practice Gap: When Q-Ensembles and CEM Fail¶

This section examines the gap between theoretical guarantees and empirical reality. Q-regression + CEM sounds elegant on paper, but production deployment reveals three classes of failures. Our's honest empiricism principle demands we diagnose these issues before claiming success.

7.5.1 The Overestimation Problem (and Why Ensembles Help)¶

Theory: In the tabular setting, Q-learning with function approximation converges to $Q^*$ under standard realizability assumptions (see Chapter 3, convergence theorem for tabular Q-learning).

Practice violation: Neural Q-networks systematically overestimate values in out-of-distribution (OOD) regions.

Why overestimation happens: 1. Maximization bias (Thrun & Schwartz, 1993): $\operatorname{argmax}_a Q_\theta(x,a)$ selects actions where $Q_\theta$ is erroneously high, not truly optimal 2. Jensen's inequality: $\max \mathbb{E}[Q] \geq \mathbb{E}[\max Q]$, so maximizing over noisy estimates inflates values 3. Extrapolation error: CEM explores $(x,a)$ pairs far from training data → $Q_\theta$ unconstrained

Empirical demonstration:

We ran 5000 episodes with: - Single Q-network: $Q_{\text{single}}$ - 5-member ensemble: $\{Q_i\}_{i=1}^5$, aggregate $\mu(x,a)$

For 1000 held-out $(x,a)$ pairs, we computed: - Predicted value: $\hat{Q}(x,a)$ - True reward: $r$ (observed from simulator) - Overestimation error: $\hat{Q} - r$

Results:

Method	Mean Error	Abs Mean Error	Max Error
Single Q	+2.8 (overest)	3.1	+12.4
Ensemble $\mu$	+0.4 (mild overest)	1.2	+4.2
Ensemble min (TD3-style)	-0.3 (underest)	1.5	-3.8

Interpretation: - Single Q overestimates by ~2.8 reward units on average (20% of typical reward scale) - Ensemble mean reduces overestimation to 0.4 (85% improvement) - Taking $\min$ over ensemble (TD3/SAC approach) introduces slight underestimation but avoids catastrophic overestimation

Why ensembles help: - Disagreement = uncertainty: OOD regions → high $\sigma(x,a)$ → UCB penalty ($\beta \cdot \sigma$) discourages selection - Averaging: Random initializations + SGD noise → uncorrelated errors → $\mathbb{E}[Q_i]$ closer to truth

Why ensembles don't fully solve it: - Shared misspecification: If all $Q_i$ miss a nonlinearity, ensemble can't fix it - Correlation from data: If all $Q_i$ trained on same biased dataset, errors correlate

Production fix: Combine ensemble with pessimism (Chapter 13: CQL, IQL penalize OOD actions).

7.5.2 CEM Failure Modes: When Gradient-Free Optimization Fails¶

Theory: CEM converges to global optimum of $Q$ if $Q$ is smooth and population size is sufficient (De Boer et al., 2005).

Practice violations:

Failure 1: Local optima in high-dimensional action spaces

For $d_a=20$ boost dimensions, the landscape $Q(x,\cdot)$ has many local maxima. CEM's Gaussian sampling explores locally but can get stuck.

Empirical test: - Initialize CEM from 10 random starting points $a_{\text{init}} \sim U([-a_{\max}, +a_{\max}]^{d_a})$ - Run CEM for 5 iterations, record final action $a_{\text{final}}$ and $Q(x, a_{\text{final}})$ - Compare: Do all 10 runs converge to the same $a_{\text{final}}$?

Results ($d_a=20$, $N_{\text{pop}}=64$): - Converge to same action ($\|a_i - a_j\| < 0.1$): 40% of queries - Multiple local optima (>2 distinct finals): 60% of queries - Worst-case Q gap: 15% (best final vs worst final)

Interpretation: CEM is not reliably finding global optimum. 60% of time we're leaving 10-15% reward on table.

Why gradient-free matters:

Could we use gradient ascent on $Q_\theta$ instead? Yes, but: - Pros: Faster convergence (fewer Q evaluations), better local optima escape (momentum) - Cons: Requires differentiating through $Q_\theta$ (backprop through ensemble), less robust to adversarial gradients

Production trade-off: - Use CEM for $d_a \leq 20$ (gradient-free simplicity, handles box constraints naturally) - Use DDPG/SAC for $d_a > 50$ (gradient-based necessary for efficiency) - Chapter 12 revisits this with differentiable policies

Failure 2: Insufficient population size

CEM with $N_{\text{pop}}=16$ explores too sparsely → premature convergence.

Rule of thumb (empirical calibration): $$N_{\text{pop}} \geq 4d_a + 16$$

For $d_a=10$: $N_{\text{pop}} \geq 56$ (we use 64, marginal safety) For $d_a=20$: $N_{\text{pop}} \geq 96$ (demo uses 64—insufficient!)

Failure 3: Trust region interaction

If trust region $\delta$ is too small, CEM can't escape local basin even with large population.

Example: - Current action $a_{\text{prev}} = [0.1, 0.1, \ldots, 0.1]$ (conservative boosts) - True optimum $a^* = [0.8, -0.5, 0.3, \ldots]$ (aggressive, 2.5 units away) - Trust region $\delta = 0.2$ (tight) - CEM can only explore $\|a - a_{\text{prev}}\| \leq 0.2$ → can't reach $a^*$

Adaptive solution: Expand $\delta$ when $\sigma(x,a)$ is high (uncertain region, safe to explore).

7.5.3 Uncertainty Calibration: Are Ensemble Estimates Honest?¶

Theory: If ensemble members are truly independent, variance $\sigma^2(x,a)$ estimates epistemic uncertainty.

Practice check: Are uncertainties calibrated? Does $\sigma(x,a)$ predict actual error?

Calibration test:

For 1000 test points $(x_i, a_i)$: 1. Predict: $\mu_i = \text{ensemble mean}, \sigma_i = \text{ensemble std}$ 2. Observe: $r_i = \text{true reward}$ 3. Compute standardized error: $z_i = (r_i - \mu_i) / \sigma_i$

If uncertainties are calibrated, $z_i \sim \mathcal{N}(0,1)$ (standard normal).

Results:

Statistic	Ideal ($\mathcal{N}(0,1)$)	Ensemble (ours)
Mean($z$)	0.0	-0.12
Std($z$)	1.0	1.35
95% CI coverage	95%	88%

Interpretation: - Underconfident: $\sigma$ is 35% too large (could reduce $\beta$ in UCB formula) - Coverage deficit: 88% vs 95% → ensemble misses 7% of outliers

Why miscalibration happens: - Finite ensemble: $N=5$ members → high sampling variance in $\sigma$ estimate - Correlated training: All members see same data → errors correlate - Architectural: Same width/depth → similar function class → shared blind spots

Production fix: - Calibrate $\beta$ adaptively: Fit $\beta$ on held-out data to achieve target coverage - Use larger ensembles ($N=10$-20) for production - Consider Bayesian ensembles (prior over networks) for better uncertainty

In practice, we approximate this calibration experiment by logging $(\mu(x,a), \sigma(x,a), r)$ triples during runs of scripts/ch07/continuous_actions_demo.py and analyzing the resulting standardized errors in a notebook or simple analysis script.

7.5.4 Modern Context: How This Compares to Frontier Methods (2020-2025)¶

Our approach (2025): - Q-ensemble with ridge regression or shallow NN - CEM for optimization - UCB exploration

State-of-art production RL (as of 2025):

QT-Opt (Kalashnikov et al., 2018): - Q(x,a) regression with deep CNN (vision) - CEM for argmax - Deployed on real robots (grasping) - Similarity: Identical architecture to ours - Difference: Uses image observations, not feature vectors

CQL (Kumar et al., 2020): - Conservative Q-Learning: Penalize OOD Q-values - Offline RL (train from logs without exploration) - When it's better: If we have large logged dataset, CQL safer than online CEM - Our setting: We're online (can explore), so CQL overkill

IQL (Kostrikov et al., 2021): - Implicit Q-Learning: Expectile regression instead of max - Avoids overestimation without ensemble - Trade-off: Simpler (single Q), but less expressive (can't model multimodal Q)

Diffusion policies (Chi et al., 2023): - Model $\pi(a|x)$ as diffusion process - Bypasses Q-learning entirely (direct policy learning) - When it's better: High-dimensional $a$ (images, trajectories) - Our setting: $d_a=10$-20, Q-learning still sample-efficient

Where we stand: - Our Q-ensemble + CEM is robust baseline (2025 best practice for $d_a < 20$) - For $d_a > 50$ or offline data, consider CQL/IQL - For vision or language, consider diffusion policies

Inference latency reality:

Running CEM (5 iterations × 64 samples) requires 320 forward passes of the Q-ensemble per request. At 0.5ms per forward pass, this is 160ms—far exceeding typical e-commerce SLAs (50ms).

Production fix: Amortized Inference. Train a policy network $\pi_\phi(x) \approx \operatorname{argmax}_a Q(x, a)$ (Actor-Critic) or distill the CEM result into a student network. This reduces per-query cost to 1 forward pass (~0.5ms).

7.6 Exercises¶

Trust Region Tuning: Modify continuous_actions_demo.py to enforce a trust region constraint around the "Static Best" action. Does this improve early convergence? Use $\Delta\text{rank}@k$ (the change in top-$k$ set overlap) as your primary diagnostic to tune the radius.
Latency Profiling: Measure the wall-clock time of cem_agent.select_action. Compare it to the LinUCB selection time. Is it feasible for a 50ms SLA?
Action Shaping: The action space includes "dummy" features (e.g., litter-specific boosts). What happens if the agent learns to boost them for non-litter queries? (Hint: The feature value is 0, so the weight shouldn't matter, but regularization might pull it to 0).

Summary¶

This chapter extended Chapter 6's discrete template bandits to continuous boost vectors, achieving fine-grained control over ranking optimization. The key technical contributions are:

1. Q-ensemble regression ([DEF-7.1], §7.2): Learn reward landscape $Q(x,a)$ with uncertainty - Neural network ensemble: $N=5$ independent models - Mean prediction: $\mu(x,a) = \frac{1}{N}\sum_{i=1}^N Q_{\theta_i}(x,a)$ - Uncertainty estimate: $\sigma(x,a)$ quantifies epistemic uncertainty - Supervised learning objective: $\min_\theta \mathbb{E}[(Q_\theta(x,a) - r)^2]$

2. CEM optimization ([ALG-7.1], §7.3): Gradient-free argmax for bounded action spaces - Evolutionary search: sample → evaluate → concentrate on elites → repeat - Handles box constraints $\mathcal{A} = [-a_{\max}, +a_{\max}]^{d_a}$ naturally - Provably converges to local optima under smoothness assumptions - Practical for $d_a \leq 20$ (larger dimensions → gradient-based methods)

3. UCB exploration (§7.2): $\tilde{Q}(x,a) = \mu(x,a) + \beta \cdot \sigma(x,a)$ - Balances exploitation (high $\mu$) and exploration (high $\sigma$) - $\beta$ decay schedule: start high (explore), converge to low (exploit) - Ensemble disagreement flags OOD regions automatically

4. Trust regions ([DEF-7.4], §7.3.1): Constrain ranking shifts for safety - $\mathcal{TR}(a_{\text{prev}}, \delta) = \{a : \|a - a_{\text{prev}}\|_2 \leq \delta\}$ - Prevents catastrophic rank changes between episodes - Diagnostic: $\Delta\text{rank}@k$ measures top-k stability - Adaptive $\delta$: expand when $\sigma$ is high, tighten as uncertainty shrinks

Connection to Chapter 3: Single-episode bandits simplify Bellman equation to $Q(x,a) = \mathbb{E}[R|x,a]$ (no temporal credit assignment), enabling supervised Q-regression without iterative fixed-point solvers (§7.1.2).

Limitations identified (§7.5): - Overestimation: Neural Q-networks overestimate OOD values; ensembles reduce but don't eliminate this - CEM local optima: 60% of queries converge to suboptimal actions in $d_a=20$ space - Inference latency: 320 forward passes per query (160ms) → needs amortized inference (Chapter 12) - Uncertainty calibration: Ensemble $\sigma$ is 35% too large; requires adaptive $\beta$ tuning

Empirical results (§7.4): - CEM achieves +30% GMV over best static template (preliminary) - Learns context-dependent boost intensities discrete templates cannot represent - Future work: Add error bars, statistical significance tests, guardrail validation

Next steps: - Chapter 10: Production guardrails (CM2 floors, ΔRank@k stability) using Lagrangian methods from §3.6 - Chapter 12: Actor-critic methods (amortized inference via differentiable policies) - Chapter 13: Offline RL (CQL, IQL for learning from logged data)

This chapter represents the foundation of model-based continuous RL for ranking—learning a critic $Q(x,a)$ and optimizing it per-query. The architecture mirrors modern production systems (QT-Opt for robotics, recommendation systems at scale) while maintaining the honest theory-practice dialogue that try to maintain along the book.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

Statistic	Ideal (\(\mathcal{N}(0,1)\))	Ensemble (ours)
Mean(\(z\))	0.0	-0.12
Std(\(z\))	1.0	1.35
95% CI coverage	95%	88%