Chapter 1 — Exercises & Labs (Application Mode)

Reward design is now backed both by the closed-form objective (Chapter 1, #EQ-1.2) and by executable checks. The following labs keep theory and implementation coupled.

Lab 1.1 — Reward Aggregation in the Simulator

Goal: inspect a real simulator step, record the GMV/CM2/STRAT/CLICKS decomposition, and verify that it matches the derivation of #EQ-1.2.

Chapter 1 labs use the self-contained reference implementation in scripts/ch01/lab_solutions.py. The main chapter includes an optional end-to-end environment smoke test; the full ZooplusSearchEnv integration narrative begins in Chapter 5.

from scripts.ch01.lab_solutions import lab_1_1_reward_aggregation

_ = lab_1_1_reward_aggregation(seed=11, verbose=True)

Output (actual):

======================================================================
Lab 1.1: Reward Aggregation in the Simulator
======================================================================

Session simulation (seed=11):
  User segment: price_hunter
  Query: "cat food"

Outcome breakdown:
  GMV:    €124.46 (gross merchandise value)
  CM2:    € 18.67 (contribution margin 2)
  STRAT:  0 purchases  (strategic purchases in session)
  CLICKS: 3        (total clicks)

Reward weights (from RewardConfig):
  alpha (alpha_gmv):     1.00
  beta (beta_cm2):       0.50
  gamma (gamma_strat):   0.20
  delta (delta_clicks):  0.10

Manual computation of R = alpha*GMV + beta*CM2 + gamma*STRAT + delta*CLICKS:
  = 1.00 x 124.46 + 0.50 x 18.67 + 0.20 x 0 + 0.10 x 3
  = 124.46 + 9.34 + 0.00 + 0.30
  = 134.09

Simulator-reported reward: 134.09

Verification: |computed - reported| = 0.00 < 0.01 [OK]

The simulator correctly implements [EQ-1.2].

Tasks 1. Recompute \(R = \alpha \text{GMV} + \beta \text{CM2} + \gamma \text{STRAT} + \delta \text{CLICKS}\) from the printed outcome and confirm agreement with the reported value. 2. Run the bound validator validate_delta_alpha_bound() (or lab_1_1_delta_alpha_violation()) and record the smallest \(\delta/\alpha\) violation. (Optional extension: reproduce the same failure via the production assertion in zoosim/dynamics/reward.py:56 by calling the production compute_reward path.) 3. Push the findings back into the Chapter 1 text—this lab explains why the implementation enforces the same bounds as Remark 1.2.1.

Lab 1.2 — Delta/Alpha Bound Regression Test

Goal: keep the published examples executable via pytest so every edit to Chapter 1 remains tethered to code.

pytest tests/ch01/test_reward_examples.py -v

Output (actual):

============================= test session starts =============================
platform darwin -- Python 3.12.12, pytest-9.0.0, pluggy-1.6.0
rootdir: /Volumes/Lexar2T/src/reinforcement_learning_search_from_scratch
configfile: pyproject.toml
collecting ... collected 5 items

tests/ch01/test_reward_examples.py::test_basic_reward_comparison PASSED  [ 20%]
tests/ch01/test_reward_examples.py::test_profitability_weighting PASSED  [ 40%]
tests/ch01/test_reward_examples.py::test_rpc_diagnostic PASSED           [ 60%]
tests/ch01/test_reward_examples.py::test_delta_alpha_bounds PASSED       [ 80%]
tests/ch01/test_reward_examples.py::test_rpc_edge_cases PASSED           [100%]

============================== 5 passed in 0.15s ===============================

Tasks 1. Identify which lines in the tests correspond to the worked examples in §1.2 and to the guardrail in [REM-1.2.1]. 2. Use the test names as an index: every time Chapter 1 changes a numerical claim, one of these tests should be updated in lockstep.


Lab 1.3 --- Reward Function Implementation

Goal: implement the full reward aggregation from #EQ-1.2 with data structures for session outcomes and business weights. This lab provides the complete implementation referenced in Section 1.2.

from dataclasses import dataclass
from typing import NamedTuple

class SessionOutcome(NamedTuple):
    """Outcomes from a single search session.

    Mathematical correspondence: realization omega in Omega of random variables
    (GMV, CM2, STRAT, CLICKS).
    """
    gmv: float          # Gross merchandise value (EUR)
    cm2: float          # Contribution margin 2 (EUR)
    strat_purchases: int # Number of strategic purchases in session
    clicks: int         # Total clicks

@dataclass
class BusinessWeights:
    """Business priority coefficients (alpha, beta, gamma, delta) in #EQ-1.2."""
    alpha_gmv: float = 1.0
    beta_cm2: float = 0.5
    gamma_strat: float = 0.2
    delta_clicks: float = 0.1

def compute_reward(outcome: SessionOutcome, weights: BusinessWeights) -> float:
    """Implements #EQ-1.2: R = alpha*GMV + beta*CM2 + gamma*STRAT + delta*CLICKS.

    This is the **scalar objective** we will maximize via RL.

    See `zoosim/dynamics/reward.py:42-66` for the production implementation that
    aggregates GMV/CM2/strategic purchases/clicks using `RewardConfig`
    parameters defined in `zoosim/core/config.py:195`.
    """
    return (weights.alpha_gmv * outcome.gmv +
            weights.beta_cm2 * outcome.cm2 +
            weights.gamma_strat * outcome.strat_purchases +
            weights.delta_clicks * outcome.clicks)

# Example: Compare two strategies
# Strategy A: Maximize GMV (show expensive products)
outcome_A = SessionOutcome(gmv=120.0, cm2=15.0, strat_purchases=1, clicks=3)

# Strategy B: Balance GMV and CM2 (show profitable products)
outcome_B = SessionOutcome(gmv=100.0, cm2=35.0, strat_purchases=3, clicks=4)

weights = BusinessWeights(alpha_gmv=1.0, beta_cm2=0.5, gamma_strat=0.2, delta_clicks=0.1)

R_A = compute_reward(outcome_A, weights)
R_B = compute_reward(outcome_B, weights)

print(f"Strategy A (GMV-focused): R = {R_A:.2f}")
print(f"Strategy B (Balanced):    R = {R_B:.2f}")
print(f"Delta = {R_B - R_A:.2f} (Strategy {'B' if R_B > R_A else 'A'} wins!)")

Output:

Strategy A (GMV-focused): R = 128.00
Strategy B (Balanced):    R = 118.50
Delta = -9.50 (Strategy A wins!)

Tasks 1. Verify compute_reward matches #EQ-1.2 exactly by hand-calculating \(R_A\) and \(R_B\). 2. Test with boundary cases: zero GMV, negative CM2 (loss-leader scenario), zero clicks. 3. What happens when alpha_gmv = 0? Is the function still meaningful?


Lab 1.4 --- Weight Sensitivity Analysis

Goal: explore how different business weight configurations change optimal strategy selection. This lab extends Lab 1.3 with weight recalibration.

from dataclasses import dataclass
from typing import NamedTuple

class SessionOutcome(NamedTuple):
    gmv: float
    cm2: float
    strat_purchases: int
    clicks: int

@dataclass
class BusinessWeights:
    alpha_gmv: float = 1.0
    beta_cm2: float = 0.5
    gamma_strat: float = 0.2
    delta_clicks: float = 0.1

def compute_reward(outcome: SessionOutcome, weights: BusinessWeights) -> float:
    return (weights.alpha_gmv * outcome.gmv +
            weights.beta_cm2 * outcome.cm2 +
            weights.gamma_strat * outcome.strat_purchases +
            weights.delta_clicks * outcome.clicks)

# Same outcomes as Lab 1.3
outcome_A = SessionOutcome(gmv=120.0, cm2=15.0, strat_purchases=1, clicks=3)
outcome_B = SessionOutcome(gmv=100.0, cm2=35.0, strat_purchases=3, clicks=4)

# Original weights: Strategy A wins
weights_gmv = BusinessWeights(alpha_gmv=1.0, beta_cm2=0.5, gamma_strat=0.2, delta_clicks=0.1)
print("With GMV-focused weights:")
print(f"  Strategy A: R = {compute_reward(outcome_A, weights_gmv):.2f}")
print(f"  Strategy B: R = {compute_reward(outcome_B, weights_gmv):.2f}")

# Profitability weights: Strategy B wins
weights_profit = BusinessWeights(alpha_gmv=0.5, beta_cm2=1.0, gamma_strat=0.5, delta_clicks=0.1)
print("\nWith profitability-focused weights:")
print(f"  Strategy A: R = {compute_reward(outcome_A, weights_profit):.2f}")
print(f"  Strategy B: R = {compute_reward(outcome_B, weights_profit):.2f}")

Output:

With GMV-focused weights:
  Strategy A: R = 128.00
  Strategy B: R = 118.50

With profitability-focused weights:
  Strategy A: R = 75.80
  Strategy B: R = 86.90

Tasks 1. Find weights where Strategy A and Strategy B achieve exactly equal reward. 2. Plot reward as a function of beta_cm2 / alpha_gmv ratio (from 0 to 2). At what ratio does the optimal strategy flip? 3. Identify real business scenarios where each weight configuration is appropriate (e.g., clearance sale vs. brand-building campaign).


Lab 1.5 --- RPC (Revenue per Click) Monitoring (Clickbait Detection)

Goal: implement the RPC diagnostic from Section 1.2.1 to detect clickbait strategies. A healthy system has high GMV per click; clickbait produces high CTR with low revenue per click.

from typing import NamedTuple

class SessionOutcome(NamedTuple):
    gmv: float
    cm2: float
    strat_purchases: int
    clicks: int

def compute_rpc(outcome: SessionOutcome) -> float:
    """GMV per click (revenue per click, RPC).

    Diagnostic for clickbait detection: high CTR with low RPC indicates
    the agent is optimizing delta*CLICKS at expense of alpha*GMV.
    See Section 1.2.1 for theory.
    """
    return outcome.gmv / outcome.clicks if outcome.clicks > 0 else 0.0

def validate_engagement_bound(delta: float, alpha: float, bound: float = 0.10) -> bool:
    """Check delta/alpha <= bound (Section 1.2.1 clickbait prevention)."""
    ratio = delta / alpha if alpha > 0 else float('inf')
    return ratio <= bound

# Compare revenue per click
outcome_A = SessionOutcome(gmv=120.0, cm2=15.0, strat_purchases=1, clicks=3)
outcome_B = SessionOutcome(gmv=100.0, cm2=35.0, strat_purchases=3, clicks=4)

rpc_A = compute_rpc(outcome_A)
rpc_B = compute_rpc(outcome_B)

print("Revenue per click (GMV per click):")
print(f"Strategy A: EUR {rpc_A:.2f}/click ({outcome_A.clicks} clicks -> EUR {outcome_A.gmv:.0f} GMV)")
print(f"Strategy B: EUR {rpc_B:.2f}/click ({outcome_B.clicks} clicks -> EUR {outcome_B.gmv:.0f} GMV)")
print(f"-> Strategy {'A' if rpc_A > rpc_B else 'B'} has higher-quality engagement")

# Verify delta/alpha bound
delta, alpha = 0.1, 1.0
print(f"\n[Validation] delta/alpha = {delta/alpha:.3f}")
print(f"             Bound check: {'PASS' if validate_engagement_bound(delta, alpha) else 'FAIL'} (must be <= 0.10)")

# Simulate clickbait scenario
clickbait_outcome = SessionOutcome(gmv=30.0, cm2=5.0, strat_purchases=0, clicks=15)
print(f"\n[Clickbait scenario] GMV={clickbait_outcome.gmv}, clicks={clickbait_outcome.clicks}")
print(f"  RPC = EUR {compute_rpc(clickbait_outcome):.2f}/click <- RED FLAG: very low!")

Output:

Revenue per click (GMV per click):
Strategy A: EUR 40.00/click (3 clicks -> EUR 120 GMV)
Strategy B: EUR 25.00/click (4 clicks -> EUR 100 GMV)
-> Strategy A has higher-quality engagement

[Validation] delta/alpha = 0.100
             Bound check: PASS (must be <= 0.10)

[Clickbait scenario] GMV=30, clicks=15
  RPC = EUR 2.00/click <- RED FLAG: very low!

Tasks 1. Generate 100 synthetic outcomes with varying click/GMV ratios. Plot the RPC distribution. 2. Define an alerting threshold: if RPC drops \(>10\%\) below baseline, flag for review. 3. Implement a running RPC tracker: \(\text{RPC}_t = \sum_{i=1}^t \text{GMV}_i / \sum_{i=1}^t \text{CLICKS}_i\). 4. What happens if delta/alpha = 0.20 (above bound)? Simulate and observe RPC degradation.


Lab 1.6 --- User Heterogeneity Simulation

Goal: demonstrate why static boost weights fail across different user segments. This lab implements the heterogeneity experiment from Section 1.3.

def simulate_click_probability(product_score: float, position: int,
                                user_type: str) -> float:
    """Probability of click given score and position.

    Models position bias: P(click | position k) is proportional to 1/k.
    User types have different sensitivities to boost features.

    Note: This is a simplified model for exposition. Production uses
    sigmoid utilities and calibrated position bias from BehaviorConfig.
    See zoosim/dynamics/behavior.py for the full implementation.
    """
    position_bias = 1.0 / position  # Top positions get more attention

    if user_type == "price_hunter":
        # Highly responsive to discount boosts
        relevance_weight = 0.3
        boost_weight = 0.7
    elif user_type == "premium":
        # Prioritizes base relevance, ignores discounts
        relevance_weight = 0.8
        boost_weight = 0.2
    else:
        # Default: balanced
        relevance_weight = 0.5
        boost_weight = 0.5

    # Simplified: score = relevance + boost_features
    base_relevance = product_score * 0.6  # Assume fixed base
    boost_effect = product_score * 0.4    # Boost contribution

    utility = relevance_weight * base_relevance + boost_weight * boost_effect
    return position_bias * utility

# Static boost weights: w_discount = 2.0 (aggressive discounting)
product_scores = [8.5, 8.0, 7.8, 7.5, 7.2]  # After applying w_discount=2.0

# User 1: Price hunter clicks aggressively on boosted items
clicks_hunter = [simulate_click_probability(s, i+1, "price_hunter")
                 for i, s in enumerate(product_scores)]

# User 2: Premium shopper is less responsive to discount boosts
clicks_premium = [simulate_click_probability(s, i+1, "premium")
                  for i, s in enumerate(product_scores)]

print("Click probabilities with static discount boost (w=2.0):")
print(f"Price hunter:    {[f'{p:.3f}' for p in clicks_hunter]}")
print(f"Premium shopper: {[f'{p:.3f}' for p in clicks_premium]}")
print(f"\nExpected clicks (price hunter):    {sum(clicks_hunter):.2f}")
print(f"Expected clicks (premium shopper): {sum(clicks_premium):.2f}")

# Compute efficiency loss
loss_ratio = sum(clicks_premium) / sum(clicks_hunter)
print(f"\nPremium shoppers get {(1 - loss_ratio)*100:.0f}% fewer expected clicks")
print("-> Static weights over-index on price sensitivity!")

Output:

Click probabilities with static discount boost (w=2.0):
Price hunter:    ['0.476', '0.214', '0.131', '0.100', '0.076']
Premium shopper: ['0.204', '0.092', '0.056', '0.043', '0.033']

Expected clicks (price hunter):    0.997
Expected clicks (premium shopper): 0.428

Premium shoppers get 57% fewer expected clicks
-> Static weights over-index on price sensitivity!

Tasks 1. Add a third user segment: "brand_loyalist" (80% relevance, 20% boost, but only for specific brands). How does the static weight perform? 2. Find the optimal static weight as a compromise across all three segments. What is the average loss vs. per-segment optimal? 3. Implement a simple context-aware policy: if user_type == "price_hunter": return 2.0 else: return 0.5. Measure improvement over static. 4. Plot expected clicks as a function of w_discount for each segment. Where do the curves intersect?


Lab 1.7 --- Action Space Implementation

Goal: implement the bounded continuous action space from #EQ-1.11. This lab provides the complete ActionSpace class referenced in Section 1.4.

from dataclasses import dataclass
import numpy as np

@dataclass
class ActionSpace:
    """Continuous bounded action space: [-a_max, +a_max]^K.

    Mathematical correspondence: action space A = [-a_max, +a_max]^K, a subset of R^K.
    See #EQ-1.11 for the bound constraint.
    """
    K: int           # Dimensionality (number of boost features)
    a_max: float     # Bound on each coordinate

    def sample(self, rng: np.random.Generator) -> np.ndarray:
        """Sample uniformly from A (for exploration)."""
        return rng.uniform(-self.a_max, self.a_max, size=self.K)

    def clip(self, a: np.ndarray) -> np.ndarray:
        """Project action onto A (enforces bounds).

        This is crucial: if a policy network outputs unbounded logits,
        we must clip to ensure a in A.
        """
        return np.clip(a, -self.a_max, self.a_max)

    def contains(self, a: np.ndarray) -> bool:
        """Check if a in A."""
        return np.all(np.abs(a) <= self.a_max)

    def volume(self) -> float:
        """Lebesgue measure of A = (2 * a_max)^K."""
        return (2 * self.a_max) ** self.K

# Example: K=5 boost features (discount, margin, PL, bestseller, recency)
action_space = ActionSpace(K=5, a_max=0.5)

# Sample random action
rng = np.random.default_rng(seed=42)
a_random = action_space.sample(rng)
print(f"Random action: {a_random}")
print(f"In bounds? {action_space.contains(a_random)}")

# Try an out-of-bounds action (e.g., from an uncalibrated policy)
a_bad = np.array([1.2, -0.3, 0.8, -1.5, 0.4])
print(f"\nBad action: {a_bad}")
print(f"In bounds? {action_space.contains(a_bad)}")

# Clip to enforce bounds
a_clipped = action_space.clip(a_bad)
print(f"Clipped:    {a_clipped}")
print(f"In bounds? {action_space.contains(a_clipped)}")

print(f"\nAction space volume: {action_space.volume():.4f}")

Output:

Random action: [-0.14 -0.36  0.47 -0.03  0.21]
In bounds? True

Bad action: [ 1.2 -0.3  0.8 -1.5  0.4]
In bounds? False
Clipped:    [ 0.5 -0.3  0.5 -0.5  0.4]
In bounds? True

Action space volume: 0.0312

Tasks 1. Extend ActionSpace to support different norms: L2 ball (\(\|a\|_2 \leq r\)) vs. Linf box (current). 2. For \(K=2\) and \(a_{\max}=1\), plot the action space. Sample 1000 points uniformly---how many fall within the L2 ball \(\|a\|_2 \leq 1\)? 3. Implement action discretization: divide each dimension into \(n\) bins and return the \(n^K\) grid points. For \(K=5, n=10\), how many discrete actions? 4. Verify clipping behavior matches zoosim/envs/search_env.py:85 by reading the production code.


Lab 1.8 — Rank-Stability Preview (Delta-Rank@k)

Goal: connect the stability constraint [EQ-1.3c] to the production stability metric Delta-Rank@k (set churn), and verify what is (and is not) wired in the simulator at this stage.

from zoosim.core import config as cfg_module
from zoosim.monitoring.metrics import compute_delta_rank_at_k

cfg = cfg_module.load_default_config()
print("lambda_rank:", cfg.action.lambda_rank)

# A pure swap within the top-10 changes order but not set membership.
ranking_prev = list(range(10))
ranking_curr = [1, 0, 2, 3, 4, 5, 6, 7, 8, 9]
print("Delta-Rank@10:", compute_delta_rank_at_k(ranking_prev, ranking_curr, k=10))

Output:

lambda_rank: 0.0
Delta-Rank@10: 0.0

Tasks 1. Verify that the Delta-Rank implementation matches the set-based definition in Chapter 10 [DEF-10.4] by constructing examples where two top-\(k\) sets differ by exactly \(m\) items (expect \(\Delta\text{-rank}@k = m/k\)). 2. Confirm that lambda_rank exists as a configuration knob (zoosim/core/config.py:230) but is not used by the simulator in Chapter 1; it is reserved for the soft-constraint (Lagrange multiplier) formulation introduced in Chapter 14 (theory in Appendix C).

Status: guardrail wiring

The configuration exposes ActionConfig.lambda_rank (zoosim/core/config.py:230), ActionConfig.cm2_floor (zoosim/core/config.py:232), and ActionConfig.exposure_floors (zoosim/core/config.py:233) so experiments remain reproducible and auditable. Chapter 10 focuses on production guardrails (monitoring, fallback, and hard feasibility filters); Chapter 14 introduces primal--dual constrained RL where multipliers such as lambda_rank become operational in the optimization formulation (implementation status: Chapter 14 §14.6).