Chapter 9 — Off-Policy Evaluation: Learning from Logged Data¶

Vlad Prytula

Introduction: The Cost of Online Experimentation¶

Chapters 7 and 8 developed online reinforcement learning methods—Q-learning and REINFORCE—that learn by interacting directly with the environment. Our continuous Q-ensemble agent (Chapter 7) achieved returns of ~25.0 by exploring the boost space $\mathcal{A} = [-a_{\max}, +a_{\max}]^{10}$ through thousands of search queries. REINFORCE (Chapter 8) learned Gaussian policies through trajectory sampling, though it struggled with variance.

Both methods share a fundamental requirement: online access to the environment. Every policy update requires executing new actions, observing rewards, and iterating. For e-commerce search, this means:

Real user impact: Poor exploration strategies degrade user experience during learning
Revenue risk: Suboptimal boosts reduce GMV while the agent learns
Slow iteration: A/B tests run for weeks to achieve statistical power
Opportunity cost: Only one policy variant can be tested at a time in production

The offline alternative. What if we could evaluate candidate policies using only logged historical data—sessions collected from past policies, stored in a database? This is off-policy evaluation (OPE): estimating $J(\pi_e) = \mathbb{E}_{\pi_e}[G]$ (the expected return of an evaluation policy $\pi_e$) from trajectories generated by a different behavior policy $\pi_b$.

Why OPE matters:

Safety: Evaluate risky policies (e.g., aggressive discount boosts) without deploying them to real users
Efficiency: Test hundreds of candidate policies on the same logged dataset, no additional data collection required
Counterfactual reasoning: Answer "what would have happened if we'd run policy $\pi$ last month?" retroactively
Bridge to offline RL: OPE provides the value estimates needed for safe offline policy optimization (Chapter 13 preview)

The challenge: distribution shift. Logged data comes from policy $\pi_b$, but we want to estimate performance under $\pi_e$. States, actions, and rewards are distributed differently under the two policies. Naively averaging logged rewards $\frac{1}{n}\sum_{i=1}^n r_i$ answers "how did $\pi_b$ perform?", not "how would $\pi_e$ perform?"

This chapter's roadmap:

§9.1: Formalize the off-policy evaluation problem and distribution shift
§9.2: Importance sampling (IS) estimators—IPS, SNIPS, per-decision IS
§9.3: Model-based methods—Direct Method (DM) and Fitted Q Evaluation (FQE)
§9.4: Hybrid estimators—Doubly Robust (DR), SWITCH, MAGIC
§9.5: Logging protocols and propensity tracking
§9.6: Empirical evaluation, calibration, and ground truth comparison
§9.7: Theory-practice gap—when OPE fails and what frontier research says
§9.8: Production implementation of zoosim/evaluation/ope.py

Prerequisites. We assume familiarity with: - Markov Decision Processes and Bellman equations (Chapter 3) - Q-function estimation (Chapter 7) - Policy gradient trajectories (Chapter 8) - Conditional expectation and propensity scores (Chapter 2)

Let's begin.

9.1 The Off-Policy Evaluation Problem¶

9.1.1 Formal Setup¶

Definition 9.1.1 (Behavior and Evaluation Policies) {#DEF-9.1.1}

Let $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ be our search MDP (as defined in [DEF-3.4.1]). We distinguish:

Behavior policy $\pi_b: \mathcal{S} \to \Delta(\mathcal{A})$ — the policy that generated logged data
Evaluation policy $\pi_e: \mathcal{S} \to \Delta(\mathcal{A})$ — the candidate policy we wish to evaluate

Both policies are stochastic mappings from states to action distributions. The behavior policy $\pi_b$ may be: - A previous production policy - An exploration policy (e.g., ε-greedy, mixture of policies) - A human-designed baseline (e.g., static boost template from Chapter 6)

The evaluation policy $\pi_e$ may be: - A newly trained RL agent (e.g., Q-ensemble from Chapter 7, REINFORCE from Chapter 8) - A refined version of $\pi_b$ with adjusted hyperparameters - A counterfactual policy for "what-if" analysis

Definition 9.1.2 (Logged Dataset) {#DEF-9.1.2}

A logged dataset $\mathcal{D}$ is a collection of $n$ trajectories generated by behavior policy $\pi_b$:

$$ \mathcal{D} = {\tau^{(i)}}{i=1}^n, \quad \tau^{(i)} = (s_0^{(i)}, a_0^{(i)}, r_0^{(i)}, s_1^{(i)}, \ldots, s) \tag{9.1} $$}^{(i)

where each trajectory is sampled as: - $s_0 \sim \rho_0$ (initial state distribution) - $a_t \sim \pi_b(\cdot | s_t)$ (behavior policy actions) - $s_{t+1} \sim P(\cdot | s_t, a_t)$ (environment transitions) - $r_t = R(s_t, a_t, s_{t+1})$ (observed rewards)

Crucially, we observe $(s_t, a_t, r_t, s_{t+1})$ but not the counterfactual "what reward would we have received had we chosen $a' \neq a_t$?"

Definition 9.1.3 (Off-Policy Value Estimation) {#DEF-9.1.3}

The off-policy evaluation problem is to estimate the expected return of $\pi_e$ from data generated by $\pi_b$:

$$ J(\pi_e) := \mathbb{E}{\tau \sim \pi_e}\left[\sum \gamma^t r_t\right] \tag{9.2} $$}^{T

given only logged data $\mathcal{D}$ from $\pi_b$.

Notation. We write $\mathbb{E}_{\pi}[\cdot]$ to denote expectation when trajectories are sampled under policy $\pi$, including both the action distribution and the resulting state distribution induced by $\pi$ interacting with dynamics $P$.

9.1.2 Why Direct Averaging Fails: The Distribution Shift Problem¶

Naive approach. Can we simply compute: $$ \hat{J}{\text{naive}}(\pi_e) = \frac{1}{n} \sum \tag{9.3} $$}^n G^{(i)}, \quad G^{(i)} = \sum_{t=0}^{T_i} \gamma^t r_t^{(i)

where $G^{(i)}$ is the return of trajectory $i$ in the logged data?

No! This estimator is biased: $$ \mathbb{E}{\mathcal{D} \sim \pi_b}[\hat{J}] = J(\pi_b) \neq J(\pi_e) $$}

The logged data was generated by $\pi_b$, so averaging returns estimates the performance of $\pi_b$, not $\pi_e$.

Example 9.1.1 (Distribution shift in search boosts).

Consider two policies for boost selection:

$\pi_b$ (behavior): Always chooses "balanced" template: $a_b = [0.2, 0.1, 0.2, \ldots]$
$\pi_e$ (evaluation): Chooses "CM2-heavy" template: $a_e = [0.5, 0.0, 0.3, \ldots]$

Suppose the logged data $\mathcal{D}$ contains 1000 trajectories, all with $a_t = a_b$ at every timestep. The average return is: $$ \frac{1}{1000} \sum_{i=1}^{1000} G^{(i)} = 18.7 $$

This tells us that $\pi_b$ achieved return 18.7. But what about $\pi_e$? We never observed what happens when $a_e$ is selected—the counterfactual data is missing.

The fundamental problem. Logged data provides: - Factual outcomes: $(s, a, r)$ for $(s, a)$ pairs that actually occurred - Missing counterfactuals: We don't observe $r'$ for $(s, a')$ pairs that could have occurred but didn't

OPE methods bridge this gap using two strategies:

Importance sampling (§9.2): Reweight observed data to correct for distribution mismatch
Model-based methods (§9.3): Build a model $\hat{R}(s, a)$ or $\hat{Q}(s, a)$ from logged data, then simulate $\pi_e$ using the model

9.1.3 Key Assumptions for OPE¶

For OPE estimators to be valid, we require several structural assumptions. These are standard in the OPE literature [@precup:eligibility_traces:2000; @thomas:high_confidence:2015; @jiang:doubly_robust:2016].

Assumption 9.1.1 (Common Support / Overlap) {#ASSUMP-9.1.1}

For all $(s, a)$ such that $\pi_e(a|s) > 0$, we require $\pi_b(a|s) > 0$. Equivalently: $$ \text{supp}(\pi_e) \subseteq \text{supp}(\pi_b) \tag{9.4} $$

Intuition. The behavior policy must "explore" all actions that the evaluation policy would take. If $\pi_e$ selects action $a^*$ in state $s$, but $\pi_b$ never chose $a^*$ in state $s$ (i.e., $\pi_b(a^*|s) = 0$), we have no data about what happens when $(s, a^*)$ occurs. Importance sampling cannot extrapolate to zero-probability regions.

Example 9.1.2 (Common Support Violation). {#EX-9.1.2}

Consider a 2-action bandit with $\mathcal{A} = \{a_1, a_2\}$. Let: - $\pi_b$: Always choose $a_1$ (deterministic, $\pi_b(a_1|s) = 1$, $\pi_b(a_2|s) = 0$) - $\pi_e$: Always choose $a_2$ (deterministic, $\pi_e(a_2|s) = 1$)

Logged data $\mathcal{D}$ contains only $(s, a_1, r_1)$ tuples—no observations of $a_2$. The importance weight for any trajectory under $\pi_e$ is: $$ w(\tau) = \frac{\pi_e(a_2|s)}{\pi_b(a_2|s)} = \frac{1}{0} = \infty $$

The IPS estimator is undefined. Even if we "clip" the weight, we have no data about rewards for $a_2$—the estimator cannot learn anything about $\pi_e$'s performance. This is the fundamental limitation: importance sampling can reweight data, but it cannot create data from regions where $\pi_b$ never explores.

Remark 9.1.1 (Practical overlap). In production, perfect overlap is rare. Instead, we aim for soft overlap: $\pi_b(a|s) \geq \epsilon > 0$ for some small $\epsilon$ (e.g., $\epsilon = 0.01$). This is achieved via: - ε-greedy logging: With probability $\epsilon$, choose uniformly random action (ensures $\pi_b(a|s) \geq \epsilon/|\mathcal{A}|$) - Mixture policies: Mix the production policy with an exploration policy - Entropy regularization: Add entropy bonus to encourage stochastic behavior

Assumption 9.1.2 (Unconfounded Actions) {#ASSUMP-9.1.2}

Logged actions $a_t$ are sampled from the known behavior policy $\pi_b(\cdot|s_t)$, not influenced by unobserved confounders. Formally, conditional on $s_t$, the action $a_t$ is independent of potential outcomes $\{R(s_t, a, s')\}_{a \in \mathcal{A}}$.

Intuition. This is the "no unmeasured confounding" assumption from causal inference [@pearl:causality:2009]. It fails if, for example: - A human operator manually overrides the policy for "difficult" users (unobserved difficulty confounds action selection and reward) - Logged actions depend on hidden state variables not recorded in $s_t$

For our simulator (Chapter 4-5), this assumption holds by construction: $\pi_b$ is a known stochastic policy with no hidden state.

Assumption 9.1.3 (Known Propensities) {#ASSUMP-9.1.3}

The behavior policy probabilities $\pi_b(a_t | s_t)$ are either: - Known exactly (deterministic logging policy) - Logged alongside $(s, a, r)$ tuples in the dataset

Remark 9.1.2 (Propensity estimation). If propensities are not logged, we must estimate $\hat{\pi}_b(a|s)$ from data. This introduces propensity estimation error, which compounds with other sources of bias. We discuss propensity learning in §9.5.

Assumption 9.1.4 (Bounded Rewards and Finite Variance) {#ASSUMP-9.1.4}

(a) Bounded rewards: There exists $R_{\max} < \infty$ such that $|R(s, a, s')| \leq R_{\max}$ for all $(s, a, s')$.

(b) Bounded importance ratios: For finite horizon $T < \infty$, there exists $\rho_{\max} < \infty$ such that: $$ \frac{\pi_e(a|s)}{\pi_b(a|s)} \leq \rho_{\max} \quad \text{for all } (s, a) \text{ with } \pi_e(a|s) > 0 \tag{9.5} $$

Consequence. Under (a) and (b): - Returns are bounded: $|G(\tau)| \leq \frac{R_{\max}}{1-\gamma}$ - Trajectory weights are bounded: $w(\tau) = \prod_{t=0}^{T-1} \rho_t \leq \rho_{\max}^T$ - Importance-weighted returns have finite variance: $\mathbb{E}_{\pi_b}[w(\tau)^2 G(\tau)^2] \leq \rho_{\max}^{2T} \left(\frac{R_{\max}}{1-\gamma}\right)^2 < \infty$

Practical bounds. In production: - Rewards: GMV per query is capped by cart size limits ($R_{\max} \approx 500$ €), CM2 costs are finite, clicks are binary - Importance ratios: ε-greedy logging with $\epsilon = 0.05$ ensures $\pi_b(a|s) \geq \epsilon/|\mathcal{A}|$, so $\rho_{\max} \leq |\mathcal{A}|/\epsilon \approx 160$ for $|\mathcal{A}| = 8$ templates

Remark 9.1.3 (When (b) fails). Without bounded importance ratios, trajectory weights can explode. For example, if $\pi_b(a|s) = 0.001$ for some action that $\pi_e$ frequently selects, a 20-step trajectory has $w(\tau) \approx (1000)^{20} = 10^{60}$—practically infinite. §9.7.1 discusses this "curse of horizon" in detail.

9.2 Importance Sampling Estimators¶

Importance sampling (IS) is the foundational technique for OPE, dating back to [@precup:eligibility_traces:2000] in RL and earlier work in Monte Carlo statistics [@owen:monte_carlo:2013]. The core idea: reweight samples from distribution $p$ to estimate expectations under distribution $q$.

9.2.1 Importance Sampling: Foundation¶

Lemma 9.2.1 (Importance Sampling Identity) {#LEM-9.2.1}

Let $p, q$ be probability distributions on measurable space $(\mathcal{X}, \Sigma)$ with $q \ll p$ (absolutely continuous). For any measurable function $f: \mathcal{X} \to \mathbb{R}$ with $\mathbb{E}_{x \sim q}[|f(x)|] < \infty$:

$$ \mathbb{E}{x \sim q}[f(x)] = \mathbb{E} f(x)\right] \tag{9.6} $$}\left[\frac{q(x)}{p(x)

provided $p(x) > 0$ whenever $q(x) > 0$.

Proof.

For continuous distributions (discrete case is analogous): $$ \mathbb{E}{x \sim q}[f(x)] = \int f(x) q(x) \, dx = \int f(x) \frac{q(x)}{p(x)} p(x) \, dx = \mathbb{E} f(x)\right] $$}\left[\frac{q(x)}{p(x)

The ratio $w(x) := q(x)/p(x)$ is the importance weight or likelihood ratio. It corrects for the distribution mismatch between $p$ and $q$. $\square$

Intuition. Samples $x$ that are more likely under $q$ than under $p$ (i.e., $q(x) > p(x)$) get upweighted ($w(x) > 1$). Samples less likely under $q$ get downweighted ($w(x) < 1$). The weighted average over $p$-samples equals the expectation under $q$.

Remark 9.2.0 (Radon-Nikodym Connection). {#REM-9.2.0}

[LEM-9.2.1] is a direct consequence of the Radon-Nikodym theorem from measure theory. We state it explicitly for completeness:

Theorem 9.2.0 (Radon-Nikodym, Informal Statement) {#THM-9.2.0-RN}

Let $\mu, \nu$ be $\sigma$-finite measures on measurable space $(\mathcal{X}, \Sigma)$ with $\nu \ll \mu$ (i.e., $\nu$ is absolutely continuous with respect to $\mu$: for all $A \in \Sigma$, $\mu(A) = 0 \Rightarrow \nu(A) = 0$). Then there exists a non-negative measurable function $f: \mathcal{X} \to [0, \infty)$, unique $\mu$-a.e., such that: $$ \nu(A) = \int_A f \, d\mu \quad \text{for all } A \in \Sigma $$ The function $f$ is the Radon-Nikodym derivative $d\nu/d\mu$.

For a complete proof, see Chapter 2 [THM-2.3.4-RN] or [@folland:real_analysis:1999, §3.2].

If $q \ll p$ (i.e., $q$ is absolutely continuous with respect to $p$), the Radon-Nikodym derivative $dq/dp$ exists, and: $$ \mathbb{E}_q[f] = \int f \, dq = \int f \frac{dq}{dp} \, dp = \mathbb{E}_p\left[f \cdot \frac{dq}{dp}\right] $$

The importance weight $w(x) = q(x)/p(x)$ is precisely the Radon-Nikodym derivative evaluated at $x$: $$ w(x) = \frac{dq}{dp}(x) $$

This measure-theoretic foundation ensures the IS identity holds in full generality—for discrete, continuous, and mixed distributions—and connects OPE to the broader theory of change-of-measure techniques in probability (Girsanov theorem, Esscher transform, exponential families via Esscher transform).

Connection to Chapter 2. For readers who completed Chapter 2's treatment of measure theory, this is the Radon-Nikodym theorem [THM-2.3.4-RN] applied to trajectory distributions. The common support assumption [ASSUMP-9.1.1] is equivalent to absolute continuity $p_{\pi_e} \ll p_{\pi_b}$: the evaluation policy's trajectory distribution must be absolutely continuous with respect to the behavior policy's. When this fails (e.g., $\pi_e$ takes actions with zero probability under $\pi_b$), the Radon-Nikodym derivative is undefined, and importance sampling breaks down—exactly as shown in [EX-9.1.2]. You first saw a single-step, bandit-level version of IPS, clipped IPS, and SNIPS in Chapter 2; this chapter generalizes those ideas to full trajectories.

9.2.2 Importance Sampling for Trajectories¶

Applying IS to OPE. Let $\tau = (s_0, a_0, r_0, \ldots, s_T)$ be a trajectory. The trajectory distribution under policy $\pi$ is:

$$ p_\pi(\tau) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi(a_t | s_t) P(s_{t+1} | s_t, a_t) \tag{9.7} $$

This is identical to [EQ-8.4] from Chapter 8 (policy gradient theorem).

The importance weight for a trajectory $\tau$ is:

$$ w(\tau) = \frac{p_{\pi_e}(\tau)}{p_{\pi_b}(\tau)} = \frac{\prod_{t=0}^{T-1} \pi_e(a_t | s_t) \prod_{t=0}^{T-1} P(s_{t+1}|s_t, a_t)}{\prod_{t=0}^{T-1} \pi_b(a_t | s_t) \prod_{t=0}^{T-1} P(s_{t+1}|s_t, a_t)} \tag{9.8} $$

Key observation: The dynamics $P$ cancel!

$$ w(\tau) = \prod_{t=0}^{T-1} \frac{\pi_e(a_t | s_t)}{\pi_b(a_t | s_t)} \tag{9.9} $$

This is remarkable: we can compute importance weights without knowing the environment dynamics $P$. Only the policy probabilities $\pi_e(a|s)$ and $\pi_b(a|s)$ are needed.

Definition 9.2.1 (Importance Sampling Estimator, IPS) {#DEF-9.2.1}

Given logged dataset $\mathcal{D} = \{\tau^{(i)}\}_{i=1}^n$ from behavior policy $\pi_b$, the importance sampling (IS) estimator for $J(\pi_e)$ is:

$$ \hat{J}{\text{IS}}(\pi_e) = \frac{1}{n} \sum) \tag{9.10} $$}^n w(\tau^{(i)}) G(\tau^{(i)

where $G(\tau^{(i)}) = \sum_{t=0}^{T_i} \gamma^t r_t^{(i)}$ is the observed return and $w(\tau^{(i)}) = \prod_{t=0}^{T_i-1} \frac{\pi_e(a_t^{(i)} | s_t^{(i)})}{\pi_b(a_t^{(i)} | s_t^{(i)})}$ is the trajectory importance weight.

Theorem 9.2.1 (Unbiasedness of IS Estimator) {#THM-9.2.1}

Under [ASSUMP-9.1.1] (common support) and [ASSUMP-9.1.4] (bounded rewards), the IS estimator is unbiased:

$$ \mathbb{E}{\mathcal{D} \sim \pi_b}[\hat{J}(\pi_e)] = J(\pi_e) \tag{9.11} $$}

Proof.

Integrability check. Under [ASSUMP-9.1.4(a)] (bounded rewards), returns are bounded: $$ |G(\tau)| \leq R_{\max} \sum_{t=0}^T \gamma^t \leq \frac{R_{\max}}{1-\gamma} =: G_{\max} $$ Under [ASSUMP-9.1.4(b)] (bounded importance ratios), trajectory weights are bounded: $w(\tau) \leq \rho_{\max}^T$. Therefore: $$ \mathbb{E}{\pi_b}[|w(\tau) G(\tau)|] \leq \rho < \infty $$ ensuring the expectation }^T \cdot G_{\max$\mathbb{E}_{\pi_b}[w(\tau) G(\tau)]$ is well-defined.

Main argument. Fix a single trajectory $\tau \sim \pi_b$. By [LEM-9.2.1]: $$ \mathbb{E}{\tau \sim \pi_b}[w(\tau) G(\tau)] = \mathbb{E}[G(\tau)] = J(\pi_e) $$

Averaging over $n$ i.i.d. trajectories: $$ \mathbb{E}{\mathcal{D}}[\hat{J})] = J(\pi_e) $$ }}] = \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n w(\tau^{(i)}) G(\tau^{(i)})\right] = \frac{1}{n} \sum_{i=1}^n \mathbb{E}[w(\tau^{(i)}) G(\tau^{(i)$\square$

Proposition 9.2.1 (Variance of IPS Estimator) {#PROP-9.2.1}

Under [ASSUMP-9.1.1]-[ASSUMP-9.1.4], the variance of the IPS estimator is:

$$ \text{Var}{\pi_b}[\hat{J}[w(\tau)^2 G(\tau)^2] - J(\pi_e)^2 \right) \tag{9.12} $$}}] = \frac{1}{n} \left( \mathbb{E}_{\pi_b

Proof. For i.i.d. samples, $\text{Var}[\frac{1}{n}\sum_i X_i] = \frac{1}{n}\text{Var}[X_1]$. With $X = w(\tau) G(\tau)$: $$ \text{Var}{\pi_b}[w(\tau) G(\tau)] = \mathbb{E}}[w(\tau)^2 G(\tau)^2] - (\mathbb{E{\pi_b}[w(\tau) G(\tau)])^2 = \mathbb{E}[w(\tau)^2 G(\tau)^2] - J(\pi_e)^2 $$ $\square$

Remark 9.2.0.1 (Second moment explosion). The critical term is $\mathbb{E}_{\pi_b}[w(\tau)^2]$. Under [ASSUMP-9.1.4(b)], this is bounded by $\rho_{\max}^{2T}$. For $\rho_{\max} = 10$ and $T = 20$: $$ \mathbb{E}_{\pi_b}[w(\tau)^2] \leq 10^{40} $$

Even if returns $G$ are bounded, variance grows exponentially in horizon $T$. This is the fundamental limitation of trajectory-level importance sampling.

The variance problem. While unbiased, the IS estimator can have extremely high variance. Consider:

If $\pi_e$ and $\pi_b$ are very different, some trajectory weights $w(\tau)$ become enormous
Trajectory weights are products: $w(\tau) = \prod_t \frac{\pi_e(a_t|s_t)}{\pi_b(a_t|s_t)}$
For episode length $T$, even moderate per-step ratios compound exponentially: $(1.5)^{50} \approx 6.4 \times 10^8$

Example 9.2.1 (Weight explosion).

Suppose $\pi_b$ is ε-greedy with $\epsilon = 0.1$ over $|\mathcal{A}| = 8$ templates. On each timestep, the greedy action has probability $\pi_b(a_{\text{greedy}}|s) = 0.9 + 0.1/8 \approx 0.9125$, and non-greedy actions have $\pi_b(a|s) = 0.1/8 = 0.0125$.

Now consider $\pi_e$ deterministic (always selects the greedy action). For a trajectory where $\pi_e$ agrees with $\pi_b$'s greedy choice at all $T = 20$ timesteps:

\[ w(\tau) = \left(\frac{1.0}{0.9125}\right)^{20} \approx 9.3 \]

But for a trajectory where $\pi_e$ disagrees at just 5 timesteps (selects an action that $\pi_b$ chose with probability 0.0125):

\[ w(\tau) = \left(\frac{1.0}{0.0125}\right)^5 \left(\frac{1.0}{0.9125}\right)^{15} \approx 3.3 \times 10^9 \]

A single unlucky trajectory can dominate the average, yielding huge variance.

9.2.3 Self-Normalized Importance Sampling (SNIPS)¶

The self-normalized IS estimator trades a small bias for dramatic variance reduction [@precup:eligibility_traces:2000; @thomas:high_confidence:2015].

Definition 9.2.2 (Self-Normalized IS, SNIPS) {#DEF-9.2.2}

$$ \hat{J}{\text{SNIPS}}(\pi_e) = \frac{\sum \tag{9.13} $$}^n w(\tau^{(i)}) G(\tau^{(i)})}{\sum_{i=1}^n w(\tau^{(i)})

Instead of $\frac{1}{n}$, we normalize by the sum of weights.

Why this helps. If one trajectory has weight $w = 10^9$ but return $G = 0.5$, its contribution to the numerator is $10^9 \cdot 0.5$, but the denominator grows by $10^9$ as well. The ratio is $\frac{10^9 \cdot 0.5}{10^9 + \sum_{j \neq i} w_j} \approx 0.5$ when $\sum_{j \neq i} w_j \ll 10^9$. Extreme weights have less leverage.

Theorem 9.2.2 (Consistency of SNIPS) {#THM-9.2.2}

Under [ASSUMP-9.1.1]-[ASSUMP-9.1.4], as $n \to \infty$:

$$ \hat{J}_{\text{SNIPS}}(\pi_e) \xrightarrow{p} J(\pi_e) \tag{9.14} $$

(convergence in probability). The estimator is consistent but biased for finite $n$.

Proof.

We apply the Law of Large Numbers (LLN) to both numerator and denominator, but first we must verify integrability.

Step 1 (Integrability of importance weights). Under [ASSUMP-9.1.1] (common support), there exists $\epsilon > 0$ such that $\pi_b(a|s) \geq \epsilon$ for all $(s, a)$ with $\pi_e(a|s) > 0$. Therefore, the per-step importance ratio is bounded: $$ \rho_t = \frac{\pi_e(a_t|s_t)}{\pi_b(a_t|s_t)} \leq \frac{1}{\epsilon} =: \rho_{\max} $$

For finite horizon $T < \infty$, the trajectory importance weight satisfies: $$ w(\tau) = \prod_{t=0}^{T-1} \rho_t \leq \rho_{\max}^T < \infty $$

Hence $\mathbb{E}_{\pi_b}[w(\tau)^2] \leq \rho_{\max}^{2T} < \infty$, ensuring finite variance.

Step 2 (Integrability of weighted returns). Under [ASSUMP-9.1.4] (bounded rewards), $|r_t| \leq R_{\max}$, so: $$ |G(\tau)| = \left|\sum_{t=0}^T \gamma^t r_t\right| \leq R_{\max} \sum_{t=0}^T \gamma^t \leq \frac{R_{\max}}{1-\gamma} =: G_{\max} $$

Therefore: $$ \mathbb{E}{\pi_b}[|w(\tau) G(\tau)|] \leq \rho < \infty $$}^T \cdot G_{\max

This ensures the LLN applies to both $\{w(\tau^{(i)}) G(\tau^{(i)})\}$ and $\{w(\tau^{(i)})\}$.

Step 3 (Apply LLN). By the Strong Law of Large Numbers for i.i.d. sequences with finite first moment: $$ \frac{1}{n}\sum_{i=1}^n w(\tau^{(i)}) G(\tau^{(i)}) \xrightarrow{a.s.} \mathbb{E}{\pi_b}[w(\tau) G(\tau)] = J(\pi_e) $$ $$ \frac{1}{n}\sum[w(\tau)] = 1 $$}^n w(\tau^{(i)}) \xrightarrow{a.s.} \mathbb{E}_{\pi_b

The second equality follows from normalization: $\mathbb{E}_{\pi_b}[w(\tau)] = \mathbb{E}_{\pi_b}\left[\frac{p_{\pi_e}(\tau)}{p_{\pi_b}(\tau)}\right] = \int p_{\pi_e}(\tau) \, d\tau = 1$.

Step 4 (Continuous mapping). Since $\frac{1}{n}\sum w_i \to 1 > 0$ almost surely, by the continuous mapping theorem: $$ \hat{J}_{\text{SNIPS}} = \frac{\frac{1}{n}\sum w_i G_i}{\frac{1}{n}\sum w_i} \xrightarrow{a.s.} \frac{J(\pi_e)}{1} = J(\pi_e) $$

Almost sure convergence implies convergence in probability. $\square$

Remark 9.2.1.1 (Necessity of bounded importance weights). The integrability argument in Step 1 is essential. If $\pi_b(a|s)$ can be arbitrarily small (e.g., $\pi_b(a|s) \to 0$ in some states), then $w(\tau)$ can be unbounded, violating the LLN's moment condition. Practical recommendation: use $\epsilon$-greedy logging (§9.5.1) with $\epsilon \geq 0.01$ to ensure bounded weights.

Remark 9.2.1 (Bias-variance trade-off). SNIPS is biased for finite $n$ (the ratio of two random variables is not generally unbiased), but the bias vanishes as $n \to \infty$. Empirically, SNIPS often has lower mean squared error than IPS for $n < 10{,}000$ [@thomas:high_confidence:2015].

Remark 9.2.2 (SNIPS Bias Characterization). {#REM-9.2.2}

The bias of SNIPS is $O(1/n)$. More precisely, under regularity conditions (finite second moments), a Taylor expansion of the ratio estimator yields: $$ \text{Bias}[\hat{J}{\text{SNIPS}}] = \frac{-\text{Cov}}(w(\tau), w(\tau) G(\tau))}{n \cdot (\mathbb{E{\pi_b}[w(\tau)])^2} + O(n^{-2}) $$ Since $\mathbb{E}_{\pi_b}[w(\tau)] = 1$ (normalization), the leading bias term depends on the covariance between weights and weighted returns: $$ \text{Bias}[\hat{J} $$ When weights are positively correlated with weighted returns (e.g., high-weight trajectories tend to have high returns), SNIPS underestimates the true value. Conversely, negative correlation leads to overestimation. This characterization is useful for diagnostic purposes: if weight-return correlation is strong, SNIPS bias may be non-negligible even for moderate }}] \approx -\frac{\text{Cov}_{\pi_b}(w(\tau), w(\tau) G(\tau))}{n$n$. See [@owen:monte_carlo:2013, §9.4] for the full derivation.

Definition 9.2.4 (Effective Sample Size, Kish 1965) {#DEF-9.2.4}

For importance-weighted estimators with weights $w_1, \ldots, w_n$, the effective sample size (ESS) is:

$$ n_{\text{eff}} := \frac{\left(\sum_{i=1}^n w_i\right)^2}{\sum_{i=1}^n w_i^2} \tag{9.15} $$

Properties:

Equal weights: When $w_i = c$ for all $i$ (no distribution shift), $n_{\text{eff}} = \frac{(nc)^2}{nc^2} = n$ (full sample used)
Single dominant weight: When $w_j \gg \sum_{i \neq j} w_i$ (one trajectory dominates), $n_{\text{eff}} \approx 1$ (effectively one data point)
Diagnostic threshold: In practice, $n_{\text{eff}} / n < 0.1$ indicates severe distribution shift—OPE estimates are unreliable

Derivation. ESS arises from matching the variance of a weighted estimator to that of an unweighted estimator with fewer samples. For self-normalized IS, $\text{Var}[\hat{J}_{\text{SNIPS}}] \approx \frac{\sigma^2}{n_{\text{eff}}}$ where $\sigma^2$ is the variance of returns. The formula follows from Taylor expansion of the ratio of random variables; see [@kish:survey_sampling:1965; @owen:monte_carlo:2013, §9.3].

Implementation note. Our IPS diagnostics (§9.8.2) compute ESS as:

effective_n = (np.sum(weights) ** 2) / np.sum(weights ** 2)

This appears in the ips_estimate() return diagnostics as "effective_sample_size".

9.2.4 Per-Decision Importance Sampling (PDIS)¶

Both IPS and SNIPS use trajectory-level weights $w(\tau) = \prod_t \rho_t$, which suffer from exponential growth in trajectory length. Per-decision IS (PDIS) [@precup:eligibility_traces:2000] exploits the sequential structure of MDPs to reduce variance.

Key idea. Instead of weighting entire trajectories, we weight each time step's contribution separately using the partial importance weight up to that step.

Definition 9.2.3 (Per-Decision Importance Sampling) {#DEF-9.2.3}

$$ \hat{J}{\text{PDIS}}(\pi_e) = \frac{1}{n} \sum \tag{9.16} $$}^n \sum_{t=0}^{T_i} \gamma^t \left(\prod_{k=0}^{t} \rho_k^{(i)}\right) r_t^{(i)

where $\rho_t = \frac{\pi_e(a_t|s_t)}{\pi_b(a_t|s_t)}$ is the per-step importance weight.

Intuition. Reward $r_t$ at timestep $t$ depends directly on the action $a_t$ taken at that timestep (and the resulting transition $s_{t+1}$). To correct the distribution of the state-action pair $(s_t, a_t)$ under $\pi_b$ to that under $\pi_e$, we need importance weights for all actions up to and including $a_t$: the product $\prod_{k=0}^{t} \rho_k$. Crucially, $r_t$ does not depend on future actions $a_{t+1}, \ldots, a_T$, so we don't need weights beyond time $t$. This means weights don't compound over the full trajectory length—only up to each reward's timestep.

Theorem 9.2.3 (Unbiasedness of PDIS) {#THM-9.2.3}

Under [ASSUMP-9.1.1] and [ASSUMP-9.1.4], PDIS is unbiased:

$$ \mathbb{E}{\mathcal{D} \sim \pi_b}[\hat{J}(\pi_e)] = J(\pi_e) \tag{9.17} $$}

Proof.

We prove that for each timestep $t$: $$ \mathbb{E}{\tau \sim \pi_b}\left[\left(\prod[r_t] \tag{*} $$}^t \rho_k\right) r_t\right] = \mathbb{E}_{\tau \sim \pi_e

The proof proceeds by backward induction on the action indices, using the tower property.

Step 1 (Setup). Fix timestep $t$ and condition on the history $\mathcal{H}_t := (s_0, a_0, s_1, a_1, \ldots, s_{t-1}, a_{t-1}, s_t)$. The reward $r_t = R(s_t, a_t, s_{t+1})$ depends on $a_t$ and $s_{t+1}$, which is sampled given $(s_t, a_t)$.

Step 2 (Inner expectation over $a_t$). Conditioned on $\mathcal{H}_t = (s_0, a_0, \ldots, s_{t-1}, a_{t-1}, s_t)$, the ratios $\rho_0, \ldots, \rho_{t-1}$ are $\sigma(\mathcal{H}_t)$-measurable—they are determined by the history and hence constant with respect to the remaining randomness. Only $\rho_t = \pi_e(a_t|s_t)/\pi_b(a_t|s_t)$ remains random (depends on action $a_t \sim \pi_b(\cdot|s_t)$ yet to be sampled). By direct computation: $$ \mathbb{E}{a_t \sim \pi_b(\cdot|s_t)}[\rho_t] = \sum \pi_e(a_t|s_t) = 1 $$} \pi_b(a_t|s_t) \frac{\pi_e(a_t|s_t)}{\pi_b(a_t|s_t)} = \sum_{a_t

Since $\prod_{k=0}^{t-1} \rho_k$ is $\sigma(\mathcal{H}_t)$-measurable (a constant given the conditioning), it factors out: $$ \mathbb{E}\left[\prod_{k=0}^{t} \rho_k \,\Big|\, \mathcal{H}t\right] = \left(\prod \rho_k $$}^{t-1} \rho_k\right) \cdot \mathbb{E}[\rho_t | s_t] = \prod_{k=0}^{t-1

Step 3 (Backward induction). Apply Step 2 iteratively using the tower property: - Conditioned on $\mathcal{H}_{t-1} = (s_0, a_0, \ldots, s_{t-1})$, integrating out $a_{t-1}$: $\mathbb{E}[\prod_{k=0}^{t-1} \rho_k | \mathcal{H}_{t-1}] = \prod_{k=0}^{t-2} \rho_k$ - Continue backward until conditioned on $(s_0)$, integrating out $a_0$: $\mathbb{E}[\rho_0 | s_0] = 1$

After $t+1$ applications of the tower property: $$ \mathbb{E}{\pi_b}\left[\prod^t \rho_k \,\Big|\, s_0, \ldots, s_t\right] = 1 $$

Step 4 (Change of measure). The key insight: the expectation $\mathbb{E}_{\pi_b}[\cdot]$ over actions, reweighted by $\prod_{k=0}^t \rho_k$, becomes the expectation $\mathbb{E}_{\pi_e}[\cdot]$ over actions: $$ \mathbb{E}{\tau \sim \pi_b}\left[\left(\prod[f(s_0, a_0, \ldots, s_t, a_t)] $$}^t \rho_k\right) f(s_0, a_0, \ldots, s_t, a_t)\right] = \mathbb{E}_{\tau \sim \pi_e

for any measurable function $f$. Applying this to $f = r_t = R(s_t, a_t, s_{t+1})$: $$ \mathbb{E}{\tau \sim \pi_b}\left[\left(\prod[r_t] $$}^t \rho_k\right) r_t\right] = \mathbb{E}_{\tau \sim \pi_e

Step 5 (Sum over timesteps). Since (*) holds for each $t$: $$ \mathbb{E}{\pi_b}[\hat{J}[r_t] = J(\pi_e) $$ }}] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}\left[\sum_{t=0}^{T_i} \gamma^t \left(\prod_{k=0}^{t} \rho_k\right) r_t\right] = \sum_{t=0}^{T} \gamma^t \mathbb{E}_{\pi_e$\square$

Variance comparison. PDIS has lower variance than IPS when $T$ is large [@thomas:high_confidence:2015]. The weights $\prod_{k=0}^t \rho_k$ grow with $t$, but not as aggressively as full-trajectory weights $\prod_{k=0}^T \rho_k$.

Practical recommendation. Use PDIS or weighted PDIS (self-normalized version) as the default IS estimator for long-horizon tasks (§9.6).

9.3 Model-Based Methods¶

Importance sampling requires no model of the environment, but suffers from high variance when $\pi_e$ and $\pi_b$ differ significantly. Model-based OPE takes the opposite approach: fit a model of the reward function or value function from logged data, then use the model to estimate $J(\pi_e)$.

9.3.1 Direct Method (DM)¶

The Direct Method (DM) estimates the value function $V^\pi(s)$ directly from logged data, then averages over initial states.

Definition 9.3.1 (Direct Method Estimator) {#DEF-9.3.1}

Let $\hat{V}^{\pi_e}(s)$ be an estimate of the state-value function under $\pi_e$ (e.g., from regression on logged data). The DM estimator is:

$$ \hat{J}{\text{DM}}(\pi_e) = \mathbb{E}(s_0)] \tag{9.18} $$}[\hat{V}^{\pi_e

In practice, we approximate this as: $$ \hat{J}{\text{DM}}(\pi_e) = \frac{1}{n} \sum) \tag{9.19} $$}^n \hat{V}^{\pi_e}(s_0^{(i)

where $\{s_0^{(i)}\}_{i=1}^n$ are initial states from logged trajectories.

How do we get $\hat{V}^{\pi_e}(s)$? Two main approaches:

Fitted Q Evaluation (§9.3.2): Fit $\hat{Q}(s,a)$ from logged transitions, then compute $\hat{V}^\pi(s) = \sum_a \pi(a|s) \hat{Q}(s,a)$
Monte Carlo rollouts: Simulate trajectories under $\pi_e$ using a learned model $\hat{P}(s'|s,a)$ and $\hat{R}(s,a,s')$

Bias-variance trade-off. DM has: - Low variance: Averaged over many initial states, no importance weights - High bias: If $\hat{V}^\pi$ is misspecified (wrong model class, insufficient data, extrapolation error), the estimate is systematically wrong

When DM works. If the model class is rich and training data covers all $(s, a)$ pairs well, DM can be highly accurate. But in practice, model misspecification is the norm—logged data comes from $\pi_b$, so states visited by $\pi_e$ may be poorly covered (distribution shift).

9.3.2 Fitted Q Evaluation (FQE)¶

Fitted Q Evaluation [@le:batch_policy_learning:2019; @paine:hyperparameter_selection:2020] is the workhorse model-based OPE method. It fits a Q-function $\hat{Q}^\pi(s,a)$ to logged data by regressing on Bellman targets.

The FQE Algorithm

Input: Logged dataset $\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}_{i=1}^m$ (transitions), evaluation policy $\pi_e$, discount $\gamma$

Goal: Estimate $Q^{\pi_e}(s, a)$

By analogy with Chapter 3, the Q-function satisfies the Bellman expectation equation: $$ Q^{\pi}(s, a) = \mathbb{E}{s' \sim P(\cdot|s,a)}\left[r + \gamma \mathbb{E}[Q^\pi(s', a')]\right] \tag{9.20} $$

This is the Q-function analogue of Chapter 3's fixed-point identity [EQ-3.8] from [THM-3.5.1-Bellman] (Bellman Expectation Equation, ch03_stochastic_processes_bellman_foundations.md).

FQE procedure:

Initialize $\hat{Q}_0(s, a)$ arbitrarily (e.g., zeros or supervised regression on logged rewards)
Iterate for $k = 1, 2, \ldots, K$:

a. Construct targets: For each $(s_i, a_i, r_i, s_i') \in \mathcal{D}$: $$ y_i = r_i + \gamma \sum_{a'} \pi_e(a' | s_i') \hat{Q}_{k-1}(s_i', a') \tag{9.21} $$

(For continuous actions, replace sum with integral or Monte Carlo sample.)

b. Regression: Fit $\hat{Q}_k$ to minimize: $$ \mathcal{L}k = \sum_k(s_i, a_i) - y_i\right)^2 \tag{9.22} $$}^m \left(\hat{Q

Return $\hat{Q}_K$ as the final estimate

After convergence (or $K$ iterations), compute value estimate: $$ \hat{J}{\text{FQE}}(\pi_e) = \frac{1}{n} \sum, a) \tag{9.23} $$}^n \sum_a \pi_e(a | s_0^{(i)}) \hat{Q}_K(s_0^{(i)

where $\{s_0^{(i)}\}$ are initial states from logged trajectories.

Connection to Chapter 3. FQE is policy evaluation via Bellman iteration (analogous to the fixed-point characterization [EQ-3.8] from Chapter 3) applied to a fixed policy $\pi_e$ using logged data instead of exact transitions. The policy evaluation Bellman operator from [THM-3.5.1-Bellman] becomes:

$$ (\mathcal{T}^{\pi_e} Q)(s, a) = \mathbb{E}{s' \sim P(\cdot|s,a)}\left[R(s,a,s') + \gamma \mathbb{E}[Q(s', a')]\right] \tag{9.24} $$

FQE approximates this operator using empirical samples from $\mathcal{D}$.

Code ↔ Theory (Bellman Operator)

[EQ-9.24] is the policy evaluation Bellman operator from [THM-3.5.1-Bellman] (Bellman Expectation Equation, ch03_stochastic_processes_bellman_foundations.md)
Chapter 3's $\mathcal{T}^\pi V = R + \gamma P^\pi V$ (linear operator, [EQ-3.7]) is the value function version
FQE implements the action-value version $\mathcal{T}^\pi Q$ using fitted regression
FQE implementation: zoosim/evaluation/ope.py:530-561 (target computation + regression)
Bellman residual tracking: diagnostics["final_loss"] at line 557
Connection: EQ-9.24 depends_on THM-3.5.1-Bellman (Q-function analogue of the Bellman expectation equation)

Note: Chapter 3's canonical file ch03_stochastic_processes_bellman_foundations.md contains the Bellman equations [THM-3.5.1-Bellman] and the Banach fixed-point theorem [THM-3.6.2-Banach] that justify contraction-based convergence intuition behind FQE. Regret guarantees for bandit exploration live in Chapter 6 ([THM-6.1], [THM-6.2]) and information-theoretic lower bounds live in Appendix D ([THM-D.3.1]). Constrained optimization / CMDP duality and primal-dual methods live in Appendix C ([THM-C.2.1], [COR-C.3.1], [ALG-C.5.1]).

Why FQE works. Under correct model specification and sufficient data coverage, FQE converges to the true $Q^{\pi_e}$. The relevant Bellman operator is a $\gamma$-contraction in the supremum norm (Chapter 3, [THM-3.7.1]), and Banach's fixed-point theorem then yields convergence under exact iteration (Chapter 3, [THM-3.6.2-Banach]): $\|\hat{Q}_k - Q^{\pi_e}\|_\infty \leq \gamma^k \|\hat{Q}_0 - Q^{\pi_e}\|_\infty \to 0$ as $k \to \infty$.

Caveat (Function Approximation). The contraction property [THM-3.7.1] holds in the $\|\cdot\|_\infty$ (supremum) norm over the full state-action space. With function approximation (neural networks, linear models), FQE minimizes error in a weighted empirical norm determined by the data distribution $d_{\pi_b}$, not the $\|\cdot\|_\infty$ norm. Under distribution mismatch between $\pi_b$ and $\pi_e$, FQE can diverge—this is the "deadly triad" of off-policy learning + function approximation + bootstrapping [@sutton:barto:2018, §11.3]. Convergence guarantees require additional conditions: compatible function approximation [@tsitsiklis:analysis_td:1997], gradient correction methods [@sutton:fast_gradient:2009], or density ratio estimation [@liu:marginal_importance_sampling:2018].

When FQE fails.

Distribution shift: Logged data comes from $\pi_b$, so transitions $(s, a, r, s')$ concentrate on states visited by $\pi_b$. If $\pi_e$ visits different states, $\hat{Q}$ must extrapolate, which is unreliable.
Function approximation error: If the model class (e.g., neural network architecture) cannot represent $Q^{\pi_e}$ exactly, FQE converges to an approximate solution with persistent bias.
Compounding errors: Each iteration uses the previous estimate $\hat{Q}_{k-1}$ to construct targets. Errors accumulate over iterations.

Theory-Practice Gap (§9.7 will expand this). Despite these issues, FQE often performs well in practice when: - Model class is expressive (deep networks) - Data is abundant ($m \gg |\mathcal{S}| \times |\mathcal{A}|$) - Policies $\pi_e$ and $\pi_b$ are not too different (soft distribution shift)

9.4 Hybrid Estimators: Doubly Robust and Beyond¶

Importance sampling has low bias but high variance. Model-based methods have low variance but high bias (from misspecification). Hybrid estimators combine both to achieve robustness: correct if either the model is accurate or propensities are correct.

9.4.1 Doubly Robust (DR) Estimation¶

The Doubly Robust (DR) estimator is one of the most important tools in OPE [@jiang:doubly_robust:2016; @thomas:high_confidence:2015].

Definition 9.4.1 (Doubly Robust Estimator) {#DEF-9.4.1}

$$ \hat{J}{\text{DR}}(\pi_e) = \frac{1}{n} \sum)\right)\right] \tag{9.25} $$}^n \left[\hat{V}^{\pi_e}(s_0^{(i)}) + \sum_{t=0}^{T_i} \gamma^t \left(\prod_{k=0}^t \rho_k^{(i)}\right) \left(r_t^{(i)} + \gamma \hat{V}^{\pi_e}(s_{t+1}^{(i)}) - \hat{Q}^{\pi_e}(s_t^{(i)}, a_t^{(i)

where: - $\hat{V}^{\pi_e}(s) = \sum_a \pi_e(a|s) \hat{Q}^{\pi_e}(s, a)$ is the model-based value estimate - $\rho_t = \frac{\pi_e(a_t|s_t)}{\pi_b(a_t|s_t)}$ is the per-step importance weight - The term $r_t + \gamma \hat{V}^{\pi_e}(s_{t+1}) - \hat{Q}^{\pi_e}(s_t, a_t)$ is the TD error (temporal difference residual)

Intuition. The DR estimator has two components:

Model-based baseline: $\hat{V}^{\pi_e}(s_0)$
Importance-weighted correction: $\sum_t \left(\prod_{k=0}^t \rho_k\right) \delta_t$

where $\delta_t = r_t + \gamma \hat{V}(s_{t+1}) - \hat{Q}(s_t, a_t)$ is the TD error.

If the model $\hat{Q}$ is perfect ($\hat{Q} = Q^{\pi_e}$), the TD errors are zero in expectation, and the correction term vanishes—we get the model-based estimate. If the model is wrong but importance weights are correct, the weighted TD errors correct the bias.

Theorem 9.4.1 (Doubly Robust Unbiasedness) {#THM-9.4.1}

The DR estimator is unbiased if either:

The model is correct: $\hat{Q}^{\pi_e} = Q^{\pi_e}$ and $\hat{V}^{\pi_e} = V^{\pi_e}$, OR
The importance weights are correct and [ASSUMP-9.1.1] holds

In either case: $$ \mathbb{E}{\mathcal{D} \sim \pi_b}[\hat{J}(\pi_e)] = J(\pi_e) \tag{9.26} $$}

Proof.

Notation for this proof

$\hat{V}(s) = \sum_a \pi_e(a|s) \hat{Q}(s,a)$ — model-based value estimate (potentially incorrect)
$V^{\pi_e}(s)$, $Q^{\pi_e}(s,a)$ — true value functions under $\pi_e$ (unknown)
$\delta_t = r_t + \gamma \hat{V}(s_{t+1}) - \hat{Q}(s_t, a_t)$ — TD error using model estimates
$\rho_{0:t} = \prod_{k=0}^t \rho_k$ — cumulative importance weight through timestep $t$

Case 1 (model correct): If $\hat{Q} = Q^{\pi_e}$ and $\hat{V} = V^{\pi_e}$, the TD error satisfies: $$ r_t + \gamma V^{\pi_e}(s_{t+1}) - Q^{\pi_e}(s_t, a_t) = r_t + \gamma V^{\pi_e}(s_{t+1}) - \mathbb{E}[r_t + \gamma V^{\pi_e}(s_{t+1}) | s_t, a_t] = \text{noise} $$

(This follows from the Bellman equation [EQ-9.17].) The importance-weighted sum of zero-mean noise has expectation zero, so: $$ \mathbb{E}{\mathcal{D}}[\hat{J}(s_0) = J(\pi_e) $$}}] = \mathbb{E}[\hat{V}^{\pi_e}(s_0)] = V^{\pi_e

Case 2 (importance weights correct): We show that when importance weights are exact, the DR estimator equals the IPS estimator in expectation, which we already proved is unbiased ([THM-9.2.1]).

Step 2a (Change of measure). Consider a single trajectory $\tau \sim \pi_b$. The DR estimator contribution is: $$ \hat{V}(s_0) + \sum_{t=0}^{T} \gamma^t \left(\prod_{k=0}^t \rho_k\right) \delta_t $$ where $\delta_t = r_t + \gamma \hat{V}(s_{t+1}) - \hat{Q}(s_t, a_t)$.

Measurability check: The TD error $\delta_t$ is $\mathcal{F}_{t+1}$-measurable (depends on $s_t, a_t, r_t, s_{t+1}$), and the cumulative weight $\prod_{k=0}^t \rho_k$ is $\mathcal{F}_t$-measurable, so the product $(\prod_{k=0}^t \rho_k) \delta_t$ is $\mathcal{F}_{t+1}$-measurable. This ensures the change-of-measure identity [LEM-9.2.1] applies.

Taking expectation over $\pi_b$ and using the importance sampling identity [LEM-9.2.1]: $$ \mathbb{E}{\tau \sim \pi_b}\left[\left(\prod[\delta_t] $$}^t \rho_k\right) \delta_t\right] = \mathbb{E}_{\tau \sim \pi_e

Step 2b (Telescoping sum). We now compute $\mathbb{E}_{\pi_e}[\sum_{t=0}^T \gamma^t \delta_t]$. First observe that: $$ \mathbb{E}{\pi_e}\left[\hat{Q}(s_t, a_t) \,\big|\, s_t\right] = \sum(s_t) $$ by definition of } \pi_e(a|s_t) \hat{Q}(s_t, a) = \hat{V$\hat{V}(s) = \sum_a \pi_e(a|s) \hat{Q}(s, a)$.

Therefore, using the tower property and iterated expectations: $$ \mathbb{E}{\pi_e}\left[\sum}^T \gamma^t \delta_t\right] = \mathbb{E{\pi_e}\left[\sum(s_t, a_t)\right)\right] $$ $$ = \mathbb{E}}^T \gamma^t \left(r_t + \gamma \hat{V}(s_{t+1}) - \hat{Q{\pi_e}\left[\sum}^T \gamma^t r_t\right] + \mathbb{E{\pi_e}\left[\sum}^T \gamma^{t+1} \hat{V}(s_{t+1})\right] - \mathbb{E{\pi_e}\left[\sum(s_t, a_t)\right] $$}^T \gamma^t \hat{Q

For the last two terms, use $\mathbb{E}_{\pi_e}[\hat{Q}(s_t, a_t) | s_t] = \hat{V}(s_t)$: $$ \mathbb{E}{\pi_e}\left[\sum}^T \gamma^t \hat{Q}(s_t, a_t)\right] = \mathbb{E{\pi_e}\left[\sum(s_t)\right] $$}^T \gamma^t \hat{V

Substituting back and reindexing (let $u = t+1$ in the middle sum): $$ \mathbb{E}{\pi_e}\left[\sum}^T \gamma^t \delta_t\right] = J(\pi_e) + \mathbb{E{\pi_e}\left[\sum(s_t)\right] $$}^{T+1} \gamma^u \hat{V}(s_u) - \sum_{t=0}^T \gamma^t \hat{V

Step 2c (Telescoping cancellation). The sums telescope: $$ \sum_{u=1}^{T+1} \gamma^u \hat{V}(s_u) - \sum_{t=0}^T \gamma^t \hat{V}(s_t) = \gamma^{T+1} \hat{V}(s_{T+1}) - \hat{V}(s_0) $$

For finite-horizon episodic tasks where $s_{T+1}$ is terminal, $\hat{V}(s_{T+1}) = 0$. Therefore: $$ \mathbb{E}{\pi_e}\left[\sum(s_0)] $$}^T \gamma^t \delta_t\right] = J(\pi_e) - \mathbb{E}_{\pi_e}[\hat{V

Step 2d (Final assembly). The expected DR estimator is: $$ \mathbb{E}{\pi_b}[\hat{J}}}] = \mathbb{E{s_0}[\hat{V}(s_0)] + \mathbb{E}(s_0)] = J(\pi_e) $$}\left[\sum_{t=0}^T \gamma^t \delta_t\right] = \mathbb{E}[\hat{V}(s_0)] + J(\pi_e) - \mathbb{E}[\hat{V

The model terms cancel exactly, leaving the true expected return. $\square$

Remark 9.4.1 (The name "doubly robust"). The estimator is robust to misspecification of either the model or the propensities, but not both simultaneously. If both are wrong, DR can be biased.

Variance of DR. When the model $\hat{Q}$ is approximately correct (even if not exact), the TD errors $\delta_t$ are small, so the variance of the correction term is much lower than pure importance sampling. This is the key practical advantage: DR combines the bias robustness of IS with the variance reduction of model-based methods.

9.4.2 Weighted Doubly Robust (WDR)¶

Just as SNIPS improves IPS via self-normalization, Weighted Doubly Robust (WDR) improves DR.

Definition 9.4.2 (Weighted Doubly Robust) {#DEF-9.4.2}

$$ \hat{J}{\text{WDR}}(\pi_e) = \frac{\sum \tag{9.27} $$}^n w_i \left[\hat{V}(s_0^{(i)}) + \sum_t \gamma^t \rho_{0:t}^{(i)} \delta_t^{(i)}\right]}{\sum_{i=1}^n w_i

where $w_i = \prod_{t=0}^{T_i} \rho_t^{(i)}$ are trajectory-level weights, $\rho_{0:t} = \prod_{k=0}^t \rho_k$, and $\delta_t = r_t + \gamma \hat{V}(s_{t+1}) - \hat{Q}(s_t, a_t)$.

WDR reduces variance at the cost of small bias (asymptotically consistent, like SNIPS).

9.4.3 SWITCH Estimator¶

DR and WDR always use both IS and model components. SWITCH [@thomas:high_confidence:2015] adaptively selects between IS and DM based on estimated variance.

Definition 9.4.3 (SWITCH Estimator) {#DEF-9.4.3}

For each trajectory $i$, define a threshold $\lambda > 0$ (hyperparameter). Compute:

$$ \hat{J}{\text{SWITCH}}(\pi_e) = \frac{1}{n} \sum w_i G_i & \text{if } w_i \leq \lambda \ \hat{V}^{\pi_e}(s_0^{(i)}) & \text{otherwise} \end{cases} \tag{9.28} $$}^n \begin{cases

Intuition. When the importance weight $w_i$ is small ($\leq \lambda$), use IS (low variance). When $w_i$ is large (high variance risk), fall back to the model estimate.

Hyperparameter selection. $\lambda$ trades off bias and variance: - Small $\lambda$: Mostly use DM (low variance, high bias from model error) - Large $\lambda$: Mostly use IS (high variance, low bias)

In practice, $\lambda$ is tuned via cross-validation or by minimizing estimated MSE [@thomas:high_confidence:2015].

9.4.4 MAGIC Estimator¶

MAGIC (More Adaptive, Generalizable Importance-weighted Continuous estimator) [@thomas:data_efficient_policy_evaluation:2016] is a generalization of SWITCH that smoothly interpolates between IS and DM.

Definition 9.4.4 (MAGIC Estimator, Simplified) {#DEF-9.4.4}

$$ \hat{J}{\text{MAGIC}}(\pi_e) = \frac{1}{n} \sum)\right] \tag{9.29} $$}^n \left[\alpha_i w_i G_i + (1 - \alpha_i) \hat{V}^{\pi_e}(s_0^{(i)

where $\alpha_i \in [0, 1]$ is an adaptive weight that depends on estimated variance and model confidence.

Details. MAGIC computes $\alpha_i$ by minimizing a variance estimator subject to constraints on bias. The full formulation is technical (see [@thomas:data_efficient_policy_evaluation:2016]); the key idea is: - High-confidence trajectories (low $w_i$, model agrees with IS) get $\alpha_i \approx 1$ (trust IS) - Low-confidence trajectories (high $w_i$, model disagrees) get $\alpha_i \approx 0$ (trust model)

Practical use. MAGIC often achieves the best MSE among OPE estimators on benchmark tasks [@voloshin:empirical_study:2021], but requires careful implementation and hyperparameter tuning.

9.5 Logging Protocols and Propensity Tracking¶

For importance sampling estimators to work, we need known propensities $\pi_b(a|s)$ ([ASSUMP-9.1.3]). This section discusses how to collect data with tracked propensities.

9.5.1 ε-Greedy Logging¶

The simplest logging strategy is ε-greedy:

With probability $1 - \epsilon$, execute the production policy $\pi_{\text{prod}}(a|s)$
With probability $\epsilon$, choose uniformly random action $a \sim \text{Uniform}(\mathcal{A})$

The logging policy is: $$ \pi_b(a|s) = (1 - \epsilon) \pi_{\text{prod}}(a|s) + \epsilon \frac{1}{|\mathcal{A}|} \tag{9.30} $$

Advantages: - Known propensities: $\pi_b(a|s)$ is computable from $\pi_{\text{prod}}$ (which is known) - Exploration guarantee: $\pi_b(a|s) \geq \epsilon / |\mathcal{A}| > 0$ for all $(s, a)$, ensuring [ASSUMP-9.1.1] (common support)

Disadvantages: - Uniform exploration is inefficient: Random actions may violate constraints (e.g., CM2 floors) or be obviously bad - Cost: The fraction $\epsilon$ of traffic experiences suboptimal actions

Typical values: $\epsilon \in [0.01, 0.10]$ in production systems.

Remark 9.5.1 (ε-Greedy for Continuous Actions). {#REM-9.5.1}

[EQ-9.30] assumes $|\mathcal{A}|$ is finite (discrete action space), where $\pi_b(a|s)$ is a probability mass function (PMF) that sums to 1 over all actions.

Continuous Actions. When $\mathcal{A} \subset \mathbb{R}^K$ is continuous (e.g., our boost space $\mathcal{A} = [-a_{\max}, +a_{\max}]^K$), we must reinterpret [EQ-9.30] in terms of probability density functions (PDFs). The "uniform distribution" becomes a density on $\mathcal{A}$, and $\pi_b(a|s)$ becomes a probability density function (not a probability mass function): $$ p_{\text{uniform}}(a) = \frac{1}{\text{vol}(\mathcal{A})} = \frac{1}{(2 a_{\max})^K} \tag{9.31} $$

The ε-greedy logging density (mixing formula for continuous actions) becomes: $$ \pi_b(a|s) = (1 - \epsilon) \cdot p_{\pi_{\text{prod}}}(a|s) + \epsilon \cdot p_{\text{uniform}}(a) $$ where $p_{\pi_{\text{prod}}}(a|s)$ and $p_{\text{uniform}}(a)$ are both PDFs that integrate to 1 over $\mathcal{A}$. For a Gaussian policy $\pi_{\text{prod}}(a|s) = \mathcal{N}(a; \mu(s), \Sigma(s))$, the logging density is a mixture of a Gaussian and a uniform density.

For $K = 10$ dimensions and $a_{\max} = 1$ (our default config), $\text{vol}(\mathcal{A}) = 2^{10} = 1024$ and $p_{\text{uniform}}(a) = 1/1024 \approx 0.001$.

Implementation note: In zoosim/evaluation/ope.py:133-134, we compute uniform_density = (0.5 ** action_dim) when $a_{\max} = 1$ (since the uniform density over $[-1, 1]^K$ is $(1/2)^K$). Ensure consistency between logged propensities and this formula.

Code ↔ Config (ε-Greedy Logging)

Parameter: SimulatorConfig.logging_epsilon in zoosim/core/config.py:240
Implementation: zoosim/envs/search_env.py (planned)
Lab 9.1 uses cfg.action.logging_epsilon to mix policies

9.5.2 Mixture Policies¶

Instead of uniform exploration, we can mix multiple structured policies:

$$ \pi_b(a|s) = \sum_{j=1}^M \lambda_j \pi_j(a|s) \tag{9.32} $$

where $\lambda_j \geq 0$, $\sum_j \lambda_j = 1$, and $\{\pi_j\}$ are base policies (e.g., different boost templates from Chapter 6, or Q-ensemble variants from Chapter 7).

Advantages: - Structured exploration: Each $\pi_j$ is a reasonable policy, so users experience fewer "random" bad actions - Better coverage: If base policies are diverse, the mixture has good coverage of the state-action space

Logging propensities. For each logged tuple $(s, a, r)$, record: $$ \pi_b(a|s) = \sum_{j=1}^M \lambda_j \pi_j(a|s) $$

Example 9.5.1 (Mixture of templates).

Suppose we have 3 boost templates from Chapter 6: - $\pi_1$: "Balanced" template (GMV-focused) - $\pi_2$: "CM2-heavy" template (margin-focused) - $\pi_3$: "Exploration" template (high entropy)

We log with mixture $\pi_b = 0.6 \pi_1 + 0.3 \pi_2 + 0.1 \pi_3$. For a logged action $a = [0.5, 0.0, 0.3, \ldots]$ (which happens to be template 2) in state $s$: $$ \pi_b(a|s) = 0.6 \cdot 0 + 0.3 \cdot 1 + 0.1 \cdot 0 = 0.3 $$

(assuming each template is deterministic for simplicity).

9.5.3 Contextual Logging and Propensity Estimation¶

In some production systems, propensities are not logged explicitly. Instead, we must estimate $\hat{\pi}_b(a|s)$ from data.

Setup. Suppose we have logged dataset $\mathcal{D} = \{(s_i, a_i, r_i)\}$ but no propensities. We want to estimate $\pi_b(a|s)$.

Method 1: Supervised classification (discrete actions).

If $\mathcal{A}$ is finite, fit a classifier $\hat{\pi}_b(a|s; \phi)$ (e.g., neural network with softmax output) to predict $a$ from $s$: $$ \phi^* = \arg\min_\phi \sum_{i=1}^m -\log \hat{\pi}_b(a_i | s_i; \phi) \tag{9.33} $$

This is imitation learning of the behavior policy.

Method 2: Conditional density estimation (continuous actions).

For continuous $\mathcal{A}$, fit a conditional density model (e.g., Gaussian, mixture density network): $$ \hat{\pi}b(a|s) = \mathcal{N}(a; \mu\phi(s), \Sigma_\phi(s)) \tag{9.34} $$

Errors compound. If $\hat{\pi}_b \neq \pi_b$, importance weights $\frac{\pi_e(a|s)}{\hat{\pi}_b(a|s)}$ are biased. This bias propagates to all IS-based estimators (IPS, SNIPS, DR, etc.).

Remark 9.5.1 (Connection to Chapter 2). Estimating propensities $\pi_b(a|s)$ is analogous to estimating position bias propensities in click models (Chapter 2, §2.3). Both require modeling conditional distributions from observational data, and both introduce bias when misspecified.

9.6 Empirical Evaluation and Calibration¶

How do we know if an OPE estimator is working? Ground truth is the on-policy evaluation:

$$ J_{\text{true}}(\pi_e) = \mathbb{E}_{\tau \sim \pi_e}[G(\tau)] \tag{9.35} $$

obtained by deploying $\pi_e$ online and averaging returns. This is expensive (requires online interaction), but essential for validating OPE estimators.

9.6.1 Evaluation Protocol¶

Setup:

Collect logged data from behavior policy $\pi_b$ (e.g., 10,000 episodes)
Train candidate policies $\{\pi_e^{(1)}, \ldots, \pi_e^{(K)}\}$ (e.g., Q-ensemble, REINFORCE, different hyperparameters)
OPE estimates: For each $\pi_e^{(j)}$, compute $\hat{J}_{\text{method}}(\pi_e^{(j)})$ using various methods (IPS, SNIPS, DR, FQE, etc.)
Ground truth: Deploy each $\pi_e^{(j)}$ online, collect 1,000 episodes, compute $\hat{J}_{\text{online}}(\pi_e^{(j)}) = \frac{1}{1000}\sum_{i=1}^{1000} G_i$
Evaluate OPE quality:
Correlation: Does OPE rank policies correctly? Compute Spearman rank correlation between $\hat{J}_{\text{OPE}}$ and $\hat{J}_{\text{online}}$
Mean Squared Error: $\text{MSE} = \frac{1}{K}\sum_{j=1}^K (\hat{J}_{\text{OPE}}(\pi_e^{(j)}) - \hat{J}_{\text{online}}(\pi_e^{(j)}))^2$
Regret: Did OPE select the best policy? $\max_j \hat{J}_{\text{online}}(\pi_e^{(j)}) - \hat{J}_{\text{online}}(\pi_{\text{OPE selected}})$

Benchmark standard [@voloshin:empirical_study:2021; @fu:benchmarks_off_policy_evaluation:2021]: MSE < 5% of performance range, Spearman $\rho > 0.8$.

9.6.2 Variance Estimation and Confidence Intervals¶

OPE estimates are random variables (functions of random dataset $\mathcal{D}$). We need confidence intervals to quantify uncertainty.

For IPS/SNIPS:

Use the bootstrap [@efron:introduction_bootstrap:1994]: 1. Resample $n$ trajectories from $\mathcal{D}$ with replacement to get $\mathcal{D}^*$ 2. Compute $\hat{J}^*_{\text{IPS}}$ on $\mathcal{D}^*$ 3. Repeat $B = 1000$ times to get distribution $\{\hat{J}^{*(b)}\}_{b=1}^B$ 4. 95% CI: $[\text{percentile}_{2.5}, \text{percentile}_{97.5}]$ of bootstrap distribution

For model-based methods (DM, FQE):

Bootstrap over training data, retrain model on each bootstrap sample, compute estimate. This accounts for model uncertainty.

Proposition 9.6.0 (Concentration Bound for IPS). {#PROP-9.6.0-IPS-concentration}

Under [ASSUMP-9.1.1]–[ASSUMP-9.1.4], the IPS estimator satisfies the following concentration inequality. With probability at least $1 - \delta$: $$ \left|\hat{J}{\text{IS}} - J(\pi_e)\right| \leq \frac{\rho $$ where }^T \cdot G_{\max}}{\sqrt{n}} \cdot \sqrt{2 \log(2/\delta)$\rho_{\max} = \sup_{s,a} \frac{\pi_e(a|s)}{\pi_b(a|s)}$ is the maximum importance ratio and $G_{\max} = \frac{R_{\max}}{1-\gamma}$ is the maximum return.

Proof. Under [ASSUMP-9.1.4], the importance-weighted returns are bounded: $w(\tau) G(\tau) \in [0, \rho_{\max}^T \cdot G_{\max}]$. Applying Hoeffding's inequality to the average of $n$ i.i.d. bounded random variables yields the stated bound. $\square$

Interpretation. This bound reveals the fundamental tension in IPS: - Sample complexity: Error scales as $1/\sqrt{n}$—need $n = O(1/\epsilon^2)$ samples for error $\epsilon$ - Horizon dependence: The factor $\rho_{\max}^T$ grows exponentially in horizon $T$, explaining why IPS fails for long episodes - Distribution shift penalty: Large $\rho_{\max}$ (when $\pi_e$ differs significantly from $\pi_b$) inflates the bound

For $\rho_{\max} = 2$, $T = 10$, $G_{\max} = 100$, and $\delta = 0.05$, achieving error $\leq 5$ requires $n \geq (2^{10} \cdot 100)^2 \cdot 2 \log(40) / 25 \approx 10^{12}$ samples—impractically large. This motivates variance reduction techniques (SNIPS, DR) and shorter evaluation horizons.

Acceptance criterion (from syllabus): OPE estimates should be within 95% CI of online estimates on held-out seeds. This is tested in Lab 9.3.

9.6.3 Diagnosing OPE Failures¶

Symptoms: - High variance: CI is very wide (e.g., [5, 50] for true value 20) - Poor correlation: OPE ranks policies incorrectly - Systematic bias: OPE consistently over/underestimates

Common causes and fixes:

Symptom	Likely Cause	Fix
High variance in IPS	Poor overlap ($\pi_e$ very different from $\pi_b$)	Use SNIPS, DR, or SWITCH; clip weights; improve $\pi_b$ exploration
Systematic bias in DM/FQE	Model misspecification, distribution shift	Use richer model class; add regularization; use DR instead of pure DM
Poor correlation across methods	Insufficient data	Collect more logged data; use lower-variance estimators (SNIPS, WDR)
Weight explosion (some $w_i > 10^6$)	Heavy tails in importance weights	Clip weights: $w_i \leftarrow \min(w_i, w_{\max})$; use SWITCH/MAGIC

Weight clipping. A common heuristic: cap importance weights at some threshold $w_{\max}$ (e.g., $w_{\max} = 100$): $$ \tilde{w}i = \min(w_i, w) \tag{9.36} $$

This introduces bias (violates [THM-9.2.1]) but dramatically reduces variance. The bias-variance trade-off often favors clipping in practice.

Proposition 9.6.1 (Negative bias for nonnegative rewards) {#PROP-9.6.1}.

Clipping introduces systematic bias. Formally: assume $r_i \ge 0$ and the positivity and integrability conditions from [ASSUMP-9.1.1]--[ASSUMP-9.1.4]. Then $$ \mathbb{E}[\hat{V}_{\text{clip}}(\pi_e)] \le V(\pi_e), $$ with strict inequality whenever $\mathbb{P}\big(\tfrac{\pi_e}{\pi_b}(a\mid s) > w_{\max}, \; r>0\big) > 0$.

Proof. Since $\min\{w_{\max}, w\} \le w$ for $w \ge 0$, we have $\min\{w_{\max}, w\} \cdot r \le w \cdot r$ for $r \ge 0$. Taking expectations yields the result; strict inequality holds when clipping occurs with positive probability and $r>0$. $\square$

Interpretation. For nonnegative rewards (e.g., GMV, clicks, engagement), clipped IPS is a conservative estimator---it underestimates the true value of $\pi_e$. This conservatism is sometimes desirable (prefer to underestimate than overestimate when deploying new policies), but be aware that aggressive clipping (small $w_{\max}$) amplifies the negative bias. Lab 9.5 in the exercises provides a numerical illustration of this bias--variance trade-off.

9.7 Theory-Practice Gap: When and Why OPE Fails¶

OPE theory provides unbiasedness guarantees (IPS, DR) and consistency results (SNIPS, FQE). But production OPE systems frequently fail. This section documents the gap between theory and practice, and what frontier research (2020-2025) says about it.

9.7.1 The Effective Horizon Problem¶

Theoretical assumption: Episode length $T$ is fixed and "not too large." Importance weights grow as $w(\tau) = \prod_{t=0}^T \rho_t$, which is exponential in $T$.

Practical reality: E-commerce sessions can be long ($T = 50$ queries), and even moderate per-step ratios compound catastrophically: $$ w(\tau) = (1.2)^{50} \approx 9,100 $$

Consequence: Variance of IPS grows exponentially in $T$. Beyond $T \approx 20$, IPS becomes unusable [@liu:breaking_curse_horizon:2018].

Solutions: - PDIS (§9.2.4): Reduces growth from $O(\rho^T)$ to $O(T \rho_{\max})$ - Doubly Robust: Model-based baseline absorbs most of the return, leaving small residuals for IS correction - Truncated horizon: Evaluate policies on shorter sub-trajectories (introduces bias but manageable variance)

Open problem: No estimator achieves low bias and low variance for long horizons ($T > 50$) with significant distribution shift. This is an active research area [@uehara:minimax_optimal_ope:2022].

9.7.2 Distribution Shift and Extrapolation¶

Theoretical assumption: Common support ([ASSUMP-9.1.1]) ensures $\pi_b(a|s) > 0$ whenever $\pi_e(a|s) > 0$.

Practical reality: Even when support is technically satisfied ($\pi_b(a|s) = 0.001$), effective support may be too small. If $\pi_e$ concentrates on rare actions under $\pi_b$, data is scarce.

Example 9.7.1 (State distribution shift).

Suppose $\pi_b$ is a "low-boost" policy that keeps users on generic products. $\pi_e$ is an "aggressive-boost" policy that promotes niche products. Even if action spaces overlap, the state distributions differ:

$\pi_b$ visits states $\{s_{\text{generic}}\}$: mainstream queries, high-volume products
$\pi_e$ visits states $\{s_{\text{niche}}\}$: specialized queries, low-volume products

Logged data has few $(s_{\text{niche}}, a)$ tuples, so $\hat{Q}(s_{\text{niche}}, a)$ is poorly estimated. FQE and DR extrapolate unreliably.

Solutions: - Marginal importance sampling [@liu:marginal_importance_sampling:2018]: Weight by state distribution mismatch, not just action mismatch - Offline RL constraints (Chapter 13): Penalize policies that stray far from $\pi_b$

9.7.3 Model Misspecification in Hybrid Estimators¶

Theoretical assumption: DR is unbiased if either model or propensities are correct ([THM-9.4.1]).

Practical reality: Both are usually wrong: - Propensities: Estimated $\hat{\pi}_b \neq \pi_b$ (§9.5.3) - Model: $\hat{Q}$ has approximation error due to finite model class, regularization, limited data

Consequence: DR is biased when both components are wrong. The bias can be larger than pure IS or pure DM [@su:doubly_robust_off_policy:2020].

Mitigation: - Use ensemble models for $\hat{Q}$ to reduce misspecification - Cross-validate propensity estimator $\hat{\pi}_b$ - Diagnostic: Compare DR, IPS, DM estimates—large disagreement signals trouble

9.7.4 What Frontier Research Says (2020-2025)¶

Recent work addresses these gaps:

Pessimistic OPE [@buckman:importance_weighting_ope:2021]: Lower-confidence-bound estimates that are provably conservative (underestimate). Used for safe policy selection in offline RL (Chapter 13 preview).

Stateful OPE [@uehara:minimax_optimal_ope:2022]: Optimal minimax estimators for MDPs that exploit sequential structure beyond PDIS.

Conformal prediction for OPE [@taufiq:conformal_off_policy_bandits:2022]: Distribution-free predictive intervals for contextual bandits without bootstrap, valid under standard bandit assumptions.

OPE with function approximation [@duan:minimax_optimal_function_approximation:2020]: Minimax-optimal sample complexity bounds for fitted Q-iteration with linear function approximation, establishing when model-based methods achieve information-theoretic optimality.

Practical consensus (as of 2025): - For short horizons ($T < 20$), well-tuned IS/SNIPS is reliable - For medium horizons ($T = 20-50$), use WDR or MAGIC with learned Q-functions - For long horizons ($T > 50$), no method is fully reliable—use OPE as a filter (reject obviously bad policies), then validate top candidates online

9.8 Production Implementation¶

We now implement a complete OPE module: zoosim/evaluation/ope.py. This module provides estimators (IPS, SNIPS, PDIS, DR, FQE) that integrate with existing agents (Q-ensemble, REINFORCE) from Chapters 7-8.

9.8.1 Module Structure¶

File: zoosim/evaluation/ope.py

Dependencies: - numpy for computation - typing for type hints - zoosim.policies for loading trained agents (provides $\pi_e$) - zoosim.core.config for hyperparameters

Public API:

# Estimator functions (all return float estimate + dict of diagnostics)
def ips_estimate(
    dataset: List[Trajectory],
    pi_eval: Policy,
    pi_behavior: Optional[Policy] = None,
    gamma: float = 0.95
) -> Tuple[float, Dict[str, Any]]:
    """Importance Sampling (IPS) estimator."""

def snips_estimate(...) -> Tuple[float, Dict[str, Any]]:
    """Self-Normalized IPS (SNIPS) estimator."""

def pdis_estimate(...) -> Tuple[float, Dict[str, Any]]:
    """Per-Decision Importance Sampling (PDIS) estimator."""

def dr_estimate(
    dataset: List[Trajectory],
    pi_eval: Policy,
    q_function: QFunction,  # From FQE or trained Q-ensemble
    pi_behavior: Optional[Policy] = None,
    gamma: float = 0.95
) -> Tuple[float, Dict[str, Any]]:
    """Doubly Robust (DR) estimator."""

def fqe(
    dataset: List[Transition],
    pi_eval: Policy,
    gamma: float = 0.95,
    num_iterations: int = 100,
    model_class: str = "ensemble"  # "ensemble", "mlp", "linear"
) -> Tuple[QFunction, Dict[str, Any]]:
    """Fitted Q Evaluation (FQE). Returns trained Q-function."""

# Utility for logging with known propensities
def epsilon_greedy_logger(
    env: ZooplusSearchEnv,
    pi_prod: Policy,
    epsilon: float = 0.05,
    num_episodes: int = 1000,
    seed: int = 42
) -> List[Trajectory]:
    """Collect logged data with ε-greedy exploration."""

Data structures:

from dataclasses import dataclass
from typing import List, Dict, Any
import numpy as np

@dataclass
class Transition:
    """Single transition (s, a, r, s')."""
    state: Dict[str, Any]  # Observation dict from env.step()
    action: np.ndarray     # Action vector
    reward: float
    next_state: Dict[str, Any]
    done: bool
    propensity: float      # π_b(a|s), logged

@dataclass
class Trajectory:
    """Complete episode trajectory."""
    transitions: List[Transition]
    return_: float         # Cached return G = Σ γ^t r_t
    length: int

    def importance_weight(self, pi_eval: Policy) -> float:
        """Compute trajectory-level IS weight w(τ) = ∏_t π_e(a_t|s_t) / π_b(a_t|s_t)."""
        w = 1.0
        for t in self.transitions:
            w *= pi_eval.prob(t.action, t.state) / t.propensity
        return w

9.8.2 Core Estimator Implementations¶

IPS Estimator:

def ips_estimate(
    dataset: List[Trajectory],
    pi_eval: Policy,
    pi_behavior: Optional[Policy] = None,
    gamma: float = 0.95,
    clip_weights: Optional[float] = None
) -> Tuple[float, Dict[str, Any]]:
    """
    Importance Sampling (IPS) estimator.

    Implements [EQ-9.10] from §9.2.2.

    Args:
        dataset: List of logged trajectories from π_b
        pi_eval: Evaluation policy π_e
        pi_behavior: Behavior policy π_b (optional, if propensities not logged)
        gamma: Discount factor
        clip_weights: Optional max weight (to reduce variance)

    Returns:
        estimate: J(π_e) estimate
        diagnostics: Dict with weights, variance, effective_n
    """
    n = len(dataset)
    weights = []
    returns = []

    for traj in dataset:
        # Compute importance weight w(τ) = ∏_t π_e(a_t|s_t) / π_b(a_t|s_t)
        w = 1.0
        for trans in traj.transitions:
            if pi_behavior is not None:
                prob_b = pi_behavior.prob(trans.action, trans.state)
            else:
                prob_b = trans.propensity  # Use logged propensity

            prob_e = pi_eval.prob(trans.action, trans.state)
            w *= prob_e / prob_b

        if clip_weights is not None:
            w = min(w, clip_weights)

        weights.append(w)
        returns.append(traj.return_)

    weights = np.array(weights)
    returns = np.array(returns)

    # IPS estimate: E[w G]
    estimate = np.mean(weights * returns)

    # Diagnostics
    effective_n = (np.sum(weights) ** 2) / np.sum(weights ** 2)  # Kish's effective sample size
    diagnostics = {
        "weights_mean": np.mean(weights),
        "weights_std": np.std(weights),
        "weights_max": np.max(weights),
        "effective_sample_size": effective_n,
        "variance_estimate": np.var(weights * returns) / n
    }

    return estimate, diagnostics

SNIPS Estimator:

def snips_estimate(
    dataset: List[Trajectory],
    pi_eval: Policy,
    pi_behavior: Optional[Policy] = None,
    gamma: float = 0.95,
    clip_weights: Optional[float] = None
) -> Tuple[float, Dict[str, Any]]:
    """
    Self-Normalized Importance Sampling (SNIPS).

    Implements [EQ-9.13] from §9.2.3.
    """
    # Compute weights and returns (same as IPS)
    n = len(dataset)
    weights = []
    returns = []

    for traj in dataset:
        w = 1.0
        for trans in traj.transitions:
            prob_b = trans.propensity if pi_behavior is None else pi_behavior.prob(trans.action, trans.state)
            prob_e = pi_eval.prob(trans.action, trans.state)
            w *= prob_e / prob_b

        if clip_weights is not None:
            w = min(w, clip_weights)

        weights.append(w)
        returns.append(traj.return_)

    weights = np.array(weights)
    returns = np.array(returns)

    # SNIPS: (Σ w_i G_i) / (Σ w_i)
    estimate = np.sum(weights * returns) / np.sum(weights)

    diagnostics = {
        "weights_sum": np.sum(weights),
        "weights_mean": np.mean(weights),
        "weights_max": np.max(weights),
        "effective_sample_size": (np.sum(weights) ** 2) / np.sum(weights ** 2)
    }

    return estimate, diagnostics

PDIS Estimator:

def pdis_estimate(
    dataset: List[Trajectory],
    pi_eval: Policy,
    pi_behavior: Optional[Policy] = None,
    gamma: float = 0.95
) -> Tuple[float, Dict[str, Any]]:
    """
    Per-Decision Importance Sampling (PDIS).

    Implements [EQ-9.16] from §9.2.4.
    Lower variance than IPS for long trajectories.
    """
    total = 0.0
    n = len(dataset)

    for traj in dataset:
        # For each timestep t, compute ∏_{k=0}^t ρ_k
        rho_product = 1.0
        for t, trans in enumerate(traj.transitions):
            prob_b = trans.propensity if pi_behavior is None else pi_behavior.prob(trans.action, trans.state)
            prob_e = pi_eval.prob(trans.action, trans.state)
            rho_t = prob_e / prob_b
            rho_product *= rho_t

            # Discounted reward at timestep t
            total += (gamma ** t) * rho_product * trans.reward

    estimate = total / n

    diagnostics = {
        "num_trajectories": n,
        "avg_traj_length": np.mean([len(traj.transitions) for traj in dataset])
    }

    return estimate, diagnostics

Doubly Robust Estimator:

def dr_estimate(
    dataset: List[Trajectory],
    pi_eval: Policy,
    q_function: QFunction,
    pi_behavior: Optional[Policy] = None,
    gamma: float = 0.95
) -> Tuple[float, Dict[str, Any]]:
    """
    Doubly Robust (DR) estimator.

    Implements [EQ-9.25] from §9.4.1.
    Combines model-based baseline with IS correction.
    """
    n = len(dataset)
    estimates = []

    for traj in dataset:
        # Model-based baseline: V(s_0)
        s_0 = traj.transitions[0].state
        v_0 = q_function.value(s_0, pi_eval)  # V(s) = Σ_a π(a|s) Q(s,a)

        # IS correction: Σ_t γ^t (∏_{k≤t} ρ_k) δ_t
        correction = 0.0
        rho_product = 1.0

        for t, trans in enumerate(traj.transitions):
            prob_b = trans.propensity if pi_behavior is None else pi_behavior.prob(trans.action, trans.state)
            prob_e = pi_eval.prob(trans.action, trans.state)
            rho_t = prob_e / prob_b
            rho_product *= rho_t

            # TD error: δ_t = r_t + γ V(s_{t+1}) - Q(s_t, a_t)
            q_val = q_function.q(trans.state, trans.action)
            if not trans.done:
                v_next = q_function.value(trans.next_state, pi_eval)
            else:
                v_next = 0.0

            td_error = trans.reward + gamma * v_next - q_val
            correction += (gamma ** t) * rho_product * td_error

        estimates.append(v_0 + correction)

    estimate = np.mean(estimates)

    diagnostics = {
        "model_baseline_mean": np.mean([q_function.value(traj.transitions[0].state, pi_eval) for traj in dataset]),
        "correction_mean": np.mean([est - q_function.value(traj.transitions[0].state, pi_eval) for traj, est in zip(dataset, estimates)]),
        "estimates_std": np.std(estimates)
    }

    return estimate, diagnostics

FQE Implementation:

def fqe(
    dataset: List[Transition],  # Flattened transitions, not trajectories
    pi_eval: Policy,
    gamma: float = 0.95,
    num_iterations: int = 100,
    model_class: str = "ensemble",
    hidden_dims: List[int] = [128, 128],
    learning_rate: float = 1e-3
) -> Tuple[QFunction, Dict[str, Any]]:
    """
    Fitted Q Evaluation (FQE).

    Implements §9.3.2 algorithm. Fits Q^{π_e}(s,a) via Bellman iterations.

    Returns:
        q_function: Trained Q-function (QEnsemble or MLP)
        diagnostics: Training losses, convergence info
    """
    from zoosim.policies.q_ensemble import QEnsemble
    import torch

    # Initialize Q-function
    state_dim = dataset[0].state["features"].shape[0]  # Assuming state has 'features' key
    action_dim = dataset[0].action.shape[0]

    if model_class == "ensemble":
        q_func = QEnsemble(state_dim, action_dim, num_ensemble=5, hidden_dims=hidden_dims)
    elif model_class == "mlp":
        # Single MLP (simpler, but no uncertainty estimates)
        q_func = SimpleMLP(state_dim, action_dim, hidden_dims)
    else:
        raise ValueError(f"Unknown model_class: {model_class}")

    optimizer = torch.optim.Adam(q_func.parameters(), lr=learning_rate)
    losses = []

    # Convert dataset to tensors
    states = torch.FloatTensor(np.array([t.state["features"] for t in dataset]))
    actions = torch.FloatTensor(np.array([t.action for t in dataset]))
    rewards = torch.FloatTensor(np.array([t.reward for t in dataset]))
    next_states = torch.FloatTensor(np.array([t.next_state["features"] for t in dataset]))
    dones = torch.FloatTensor(np.array([float(t.done) for t in dataset]))

    # FQE iterations
    for k in range(num_iterations):
        # Compute targets: y_i = r_i + γ Σ_a π_e(a|s') Q_{k-1}(s', a)
        with torch.no_grad():
            # Sample actions from π_e for next states (Monte Carlo approximation of Σ_a)
            # For simplicity, use deterministic policy or top-k sample
            next_actions = torch.stack([torch.FloatTensor(pi_eval.act(next_states[i].numpy())) for i in range(len(next_states))])
            next_q_values = q_func.forward(next_states, next_actions)  # Q(s', π_e(s'))
            targets = rewards + gamma * (1 - dones) * next_q_values

        # Regression step: minimize (Q_k(s,a) - targets)^2
        q_pred = q_func.forward(states, actions)
        loss = torch.mean((q_pred - targets) ** 2)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        losses.append(loss.item())

        if k % 10 == 0:
            print(f"FQE iteration {k}/{num_iterations}, loss: {loss.item():.4f}")

    diagnostics = {
        "final_loss": losses[-1],
        "losses": losses,
        "num_iterations": num_iterations
    }

    return q_func, diagnostics

9.8.3 Integration with Existing Agents¶

Our Q-ensemble from Chapter 7 and REINFORCE from Chapter 8 can be used as $\pi_e$ for OPE. We just need to implement the Policy interface.

Policy Interface:

# zoosim/evaluation/ope.py (continued)

from abc import ABC, abstractmethod

class Policy(ABC):
    """Abstract policy interface for OPE."""

    @abstractmethod
    def act(self, state: Dict[str, Any]) -> np.ndarray:
        """Select action for given state."""
        pass

    @abstractmethod
    def prob(self, action: np.ndarray, state: Dict[str, Any]) -> float:
        """Return probability π(a|s)."""
        pass

class QEnsemblePolicy(Policy):
    """Wrapper for Q-ensemble agent from Chapter 7."""

    def __init__(self, q_ensemble_agent, exploration_noise: float = 0.0):
        self.agent = q_ensemble_agent
        self.exploration_noise = exploration_noise

    def act(self, state: Dict[str, Any]) -> np.ndarray:
        """Use CEM to optimize Q(s, a)."""
        return self.agent.act(state)  # Returns optimized action via CEM

    def prob(self, action: np.ndarray, state: Dict[str, Any]) -> float:
        """
        For deterministic policy + Gaussian noise, return density.
        π(a|s) = N(a; μ_CEM(s), σ^2 I) where μ_CEM is CEM-optimized action.
        """
        if self.exploration_noise == 0.0:
            # Deterministic: prob is Dirac delta (approximate as 1.0 if action matches)
            optimal_action = self.act(state)
            if np.allclose(action, optimal_action, atol=1e-3):
                return 1.0
            else:
                return 1e-10  # Numerical stability
        else:
            # Gaussian noise around optimal action
            optimal_action = self.act(state)
            diff = action - optimal_action
            log_prob = -0.5 * np.sum((diff / self.exploration_noise) ** 2)
            return np.exp(log_prob) / ((2 * np.pi * self.exploration_noise ** 2) ** (len(action) / 2))

class REINFORCEPolicy(Policy):
    """Wrapper for REINFORCE agent from Chapter 8."""

    def __init__(self, reinforce_agent):
        self.agent = reinforce_agent

    def act(self, state: Dict[str, Any]) -> np.ndarray:
        """Sample from Gaussian policy."""
        return self.agent.act(state, deterministic=False)

    def prob(self, action: np.ndarray, state: Dict[str, Any]) -> float:
        """Gaussian density π(a|s) = N(a; μ_θ(s), σ_θ(s))."""
        mean, log_std = self.agent.policy_net(torch.FloatTensor(state["features"]))
        std = torch.exp(log_std)

        # Compute log probability
        action_tensor = torch.FloatTensor(action)
        log_prob = -0.5 * torch.sum(((action_tensor - mean) / std) ** 2) - torch.sum(log_std) - 0.5 * len(action) * np.log(2 * np.pi)
        return torch.exp(log_prob).item()

9.8.4 Example Usage¶

Complete OPE workflow:

# scripts/ch09/ope_demo.py

import numpy as np
from zoosim.core.config import SimulatorConfig
from zoosim.envs.gym_env import GymSearchEnv
from zoosim.policies.q_ensemble import QEnsembleAgent
from zoosim.policies.reinforce import REINFORCEAgent
from zoosim.evaluation.ope import (
    epsilon_greedy_logger, ips_estimate, snips_estimate, pdis_estimate, dr_estimate, fqe,
    QEnsemblePolicy, REINFORCEPolicy
)

def main():
    cfg = SimulatorConfig()
    env = GymSearchEnv(cfg, seed=42)

    # 1. Train behavior policy (Q-ensemble from Ch7)
    print("Training behavior policy...")
    pi_b_agent = QEnsembleAgent(state_dim=17, action_dim=10, num_ensemble=5)
    # ... train pi_b on env for 5000 episodes (code from Ch7)

    # 2. Collect logged data with ε-greedy exploration
    print("Collecting logged data...")
    pi_b_policy = QEnsemblePolicy(pi_b_agent, exploration_noise=0.05)
    logged_data = epsilon_greedy_logger(env, pi_b_policy, epsilon=0.1, num_episodes=10000, seed=42)

    print(f"Collected {len(logged_data)} trajectories")
    print(f"Average return: {np.mean([traj.return_ for traj in logged_data]):.2f}")

    # 3. Train evaluation policy (REINFORCE from Ch8)
    print("Training evaluation policy (REINFORCE)...")
    pi_e_agent = REINFORCEAgent(state_dim=17, action_dim=10)
    # ... train pi_e (code from Ch8)

    pi_e_policy = REINFORCEPolicy(pi_e_agent)

    # 4. OPE estimates
    print("\n=== Off-Policy Evaluation ===")

    # IPS
    ips_est, ips_diag = ips_estimate(logged_data, pi_e_policy, pi_behavior=pi_b_policy)
    print(f"IPS estimate: {ips_est:.2f}")
    print(f"  - Effective sample size: {ips_diag['effective_sample_size']:.1f} / {len(logged_data)}")
    print(f"  - Max weight: {ips_diag['weights_max']:.2f}")

    # SNIPS (lower variance)
    snips_est, snips_diag = snips_estimate(logged_data, pi_e_policy, pi_behavior=pi_b_policy)
    print(f"SNIPS estimate: {snips_est:.2f}")

    # PDIS (exploits sequential structure)
    pdis_est, pdis_diag = pdis_estimate(logged_data, pi_e_policy, pi_behavior=pi_b_policy)
    print(f"PDIS estimate: {pdis_est:.2f}")

    # FQE (model-based)
    print("\nFitting Q-function via FQE...")
    transitions = [t for traj in logged_data for t in traj.transitions]  # Flatten
    q_func, fqe_diag = fqe(transitions, pi_e_policy, num_iterations=100, model_class="ensemble")
    fqe_est = np.mean([q_func.value(traj.transitions[0].state, pi_e_policy) for traj in logged_data])
    print(f"FQE estimate: {fqe_est:.2f}")
    print(f"  - Final loss: {fqe_diag['final_loss']:.4f}")

    # DR (hybrid)
    dr_est, dr_diag = dr_estimate(logged_data, pi_e_policy, q_func, pi_behavior=pi_b_policy)
    print(f"DR estimate: {dr_est:.2f}")
    print(f"  - Model baseline: {dr_diag['model_baseline_mean']:.2f}")
    print(f"  - IS correction: {dr_diag['correction_mean']:.2f}")

    # 5. Ground truth (online evaluation)
    print("\n=== Ground Truth (Online) ===")
    online_returns = []
    for ep in range(1000):
        env.reset()
        done = False
        ep_return = 0.0
        t = 0
        while not done:
            obs = env.get_obs()
            action = pi_e_policy.act(obs)
            _, reward, done, _ = env.step(action)
            ep_return += (0.95 ** t) * reward
            t += 1
        online_returns.append(ep_return)

    online_mean = np.mean(online_returns)
    online_ci = 1.96 * np.std(online_returns) / np.sqrt(len(online_returns))
    print(f"Online return: {online_mean:.2f} ± {online_ci:.2f} (95% CI)")

    # Compare
    print("\n=== Comparison ===")
    print(f"IPS error:   {abs(ips_est - online_mean):.2f}")
    print(f"SNIPS error: {abs(snips_est - online_mean):.2f}")
    print(f"PDIS error:  {abs(pdis_est - online_mean):.2f}")
    print(f"FQE error:   {abs(fqe_est - online_mean):.2f}")
    print(f"DR error:    {abs(dr_est - online_mean):.2f}")

    # Best estimator
    errors = [
        ("IPS", abs(ips_est - online_mean)),
        ("SNIPS", abs(snips_est - online_mean)),
        ("PDIS", abs(pdis_est - online_mean)),
        ("FQE", abs(fqe_est - online_mean)),
        ("DR", abs(dr_est - online_mean))
    ]
    best = min(errors, key=lambda x: x[1])
    print(f"\nBest estimator: {best[0]} (error {best[1]:.2f})")

if __name__ == "__main__":
    main()

Expected output:

Training behavior policy...
Collecting logged data...
Collected 10000 trajectories
Average return: 18.73

Training evaluation policy (REINFORCE)...

=== Off-Policy Evaluation ===
IPS estimate: 21.45
  - Effective sample size: 1247.3 / 10000
  - Max weight: 47.21
SNIPS estimate: 20.82
PDIS estimate: 20.91
Fitting Q-function via FQE...
FQE estimate: 20.34
  - Final loss: 1.2341
DR estimate: 20.78
  - Model baseline: 19.82
  - IS correction: 0.96

=== Ground Truth (Online) ===
Online return: 20.67 ± 0.31 (95% CI)

=== Comparison ===
IPS error:   0.78
SNIPS error: 0.15
PDIS error:  0.24
FQE error:   0.33
DR error:    0.11

Best estimator: DR (error 0.11)

In this example, Doubly Robust achieves the lowest error, followed closely by SNIPS. IPS has higher error due to variance from heavy-tailed weights.

9.9 Connections to Previous and Future Chapters¶

Backward connections:

Chapter 2 (Position Bias and Click Models): OPE propensities $\pi_b(a|s)$ are analogous to position bias propensities $P(\text{exam}|k)$ in click models. Both require careful estimation from observational data. [DEF-2.3.1] introduced propensity weighting for unbiased CTR estimates—OPE generalizes this to sequential decision-making.
Chapter 3 (Bellman Operators): FQE ([EQ-9.21]) applies the policy evaluation Bellman operator $\mathcal{T}^\pi$ from [THM-3.5.1-Bellman] (Bellman Expectation Equation) to logged data. Under exact iteration and correct model specification, the contraction property [THM-3.7.1] and Banach fixed-point theorem [THM-3.6.2-Banach] yield convergence of the corresponding Bellman iterations. Regret theory is treated separately in Chapter 6, while CMDP duality and primal-dual methods are developed in Appendix C.
Chapter 7 (Q-Ensemble): The Q-function $\hat{Q}(s,a)$ trained in Ch7 can be used directly in DR and FQE. Our production DR implementation (§9.8.2) reuses QEnsemble from zoosim/policies/q_ensemble.py.
Chapter 8 (REINFORCE): Policy gradient trajectories $\tau \sim \pi_\theta$ can be evaluated off-policy via IPS/DR. This enables hyperparameter tuning without repeated online rollouts—fit multiple REINFORCE agents offline, evaluate with OPE, select best.

Forward connections:

Chapter 11 (Multi-Episode MDPs): OPE becomes more challenging in multi-episode settings where actions affect future session returns. Importance weights compound across sessions, requiring marginal IS [@liu:marginal_importance_sampling:2018] or other advanced techniques.
Chapter 13 (Offline RL): OPE is the safety filter for offline RL. We use OPE estimates to select among candidate policies learned from logged data, then validate top candidates online. Conservative Q-Learning (CQL) and Implicit Q-Learning (IQL) incorporate OPE-style pessimism directly into training.

Exercises & Labs¶

Enhanced labs with critical improvements are in exercises_labs.md. Key additions:

Lab 9.1 (IPS from simulator logs): Now includes variance analysis across different ε values and propensity sensitivity
Lab 9.2 (FQE integration): Connects to Chapter 3 Bellman operator explicitly, uses real Q-ensemble from Ch7
Lab 9.3 (Estimator comparison): Compare IPS, SNIPS, PDIS, DR, FQE against ground truth, reproduce MSE < 5% and Spearman ρ > 0.8 benchmarks
Lab 9.4 (Distribution shift stress test): Evaluate OPE when π_e is very different from π_b, document failure modes
Lab 9.5 (Clipped IPS and SNIPS bias--variance): Numerically illustrate [PROP-9.6.1] (negative bias under clipping) and compare IPS/SNIPS variance

See Lab 9.1, Lab 9.2, Lab 9.3, Lab 9.4, Lab 9.5.

Production Checklist¶

Production Checklist (Chapter 9)

Configuration: - [ ] Set SimulatorConfig.logging_epsilon for ε-greedy exploration (recommend 0.05-0.10) - [ ] Choose OPE estimator based on trajectory length: - Short ($T < 20$): IPS or SNIPS - Medium ($T = 20-50$): WDR or DR with FQE Q-function - Long ($T > 50$): DR with aggressive weight clipping - [ ] Set weight clipping threshold (clip_weights parameter) based on effective sample size target (e.g., $n_{\text{eff}} > 1000$)

Logging: - [ ] Record propensities $\pi_b(a|s)$ alongside $(s,a,r,s')$ in production logs - [ ] Store initial state $s_0$ for each episode (needed for DM/FQE value estimates) - [ ] Log episode returns $G$ for quick sanity checks

Validation: - [ ] Compute OPE estimates on historical data with multiple estimators (IPS, SNIPS, DR) - [ ] Compare against held-out online evaluation (1000+ episodes) - [ ] Target: MSE < 5% of performance range, CI overlap with ground truth - [ ] If estimators disagree by >20%, investigate: poor overlap, model misspecification, or propensity errors

Deployment: - [ ] Use OPE for initial policy screening (filter out obviously bad policies) - [ ] Validate top 3-5 candidates online before full production rollout - [ ] Monitor OPE vs. online performance over time (detect distribution drift) - [ ] Document OPE failures (e.g., when DR/IPS differ by >2σ) for future improvements

References¶

Key papers and resources for off-policy evaluation:

Foundations: [@precup:eligibility_traces:2000; @thomas:high_confidence:2015; @jiang:doubly_robust:2016]
Doubly Robust: [@dudik:doubly_robust:2014; @jiang:doubly_robust:2016]
SWITCH/MAGIC: [@thomas:data_efficient_policy_evaluation:2016]
FQE: [@le:batch_policy_learning:2019; @paine:hyperparameter_selection:2020]
Recent advances: [@uehara:minimax_optimal_ope:2022; @taufiq:conformal_off_policy_bandits:2022]
Benchmarks: [@voloshin:empirical_study:2021; @fu:benchmarks_off_policy_evaluation:2021]

End of Chapter 9

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

Symptom	Likely Cause	Fix
High variance in IPS	Poor overlap (\(\pi_e\) very different from \(\pi_b\))	Use SNIPS, DR, or SWITCH; clip weights; improve \(\pi_b\) exploration
Systematic bias in DM/FQE	Model misspecification, distribution shift	Use richer model class; add regularization; use DR instead of pure DM
Poor correlation across methods	Insufficient data	Collect more logged data; use lower-variance estimators (SNIPS, WDR)
Weight explosion (some \(w_i > 10^6\))	Heavy tails in importance weights	Clip weights: \(w_i \leftarrow \min(w_i, w_{\max})\); use SWITCH/MAGIC