Chapter 4 — Exercises & Labs¶
Total estimated time: 2.5 hours
These exercises provide hands-on practice with the generative world model from Chapter 4. All code should be runnable in a Jupyter notebook or Python script with the zoosim package installed.
Exercise 1: Catalog Statistics (30 minutes)¶
Objective: Generate a synthetic catalog and verify distributional properties.
In a real retailer, the first thing analysts do with a new dataset is not train a model; it is to look at basic distributions. Are dog‑food prices where merchandising expects them to be? Is litter really being run as a loss‑leader, or did something drift? This exercise places us in that role for our simulator: we sanity‑check that the synthetic catalog behaves like a plausible e‑commerce assortment before trusting downstream RL experiments.
Setup:
import numpy as np
import matplotlib.pyplot as plt
from zoosim.core.config import SimulatorConfig
from zoosim.world.catalog import generate_catalog
cfg = SimulatorConfig(seed=42)
rng = np.random.default_rng(cfg.seed)
catalog = generate_catalog(cfg.catalog, rng)
Tasks:
- Price percentiles (10 min)
- Compute 25th, 50th (median), 75th percentiles for each category
- Compare median to theoretical value \(e^{\mu}\) from config
- Print results in a formatted table
python
# Your code here
Expected output:
Category | P25 | Median | P75 | Theory (e^μ)
------------------------------------------------------------
dog_food | $10.27 | $13.41 | $17.35 | $13.46
cat_food | $9.36 | $12.17 | $16.06 | $12.18
litter | $7.07 | $8.81 | $11.18 | $9.03
toys | $4.00 | $5.97 | $9.00 | $6.05
- Margin verification (10 min)
- Compute mean CM2 for each category
- Verify litter has negative average margin
- Verify toys have highest average margin
python
# Your code here
Expected results: - Litter: mean CM2 < 0 - Toys: mean CM2 > all other categories
- Price vs. CM2 scatter plot (10 min)
- Create 2x2 subplot grid (one subplot per category)
- Scatter plot of price (x-axis) vs. CM2 (y-axis)
- Overlay theoretical line: CM2 = β·price (from config)
- Add title, axis labels, legend
python
# Your code here
# Hint: Use cfg.catalog.margin_slope[category] for slope β
**Expected visualization:** Four panels with clear linear trends; slopes match `cfg.catalog.margin_slope` up to noise.
Exercise 2: User Segment Analysis (30 minutes)¶
Objective: Sample users and analyze segment-specific preferences.
Personalization only makes sense if different users truly want different things. In production, teams maintain audience definitions (“value shoppers”, “premium”, “private‑label loyalists”) and routinely inspect how those segments behave. Here we do the same with our simulated users: we verify that the segment mix matches the configuration and that each segment occupies a distinct region in preference space.
Setup:
from zoosim.world.users import sample_user
cfg = SimulatorConfig(seed=2025_1108)
rng = np.random.default_rng(cfg.seed)
# Generate 10,000 users
users = [sample_user(config=cfg, rng=rng) for _ in range(10_000)]
Tasks:
- Segment distribution (5 min)
- Count users per segment
- Compute empirical probabilities
- Compare to
cfg.users.segment_mix
python
# Your code here
Expected output:
Segment | Count | Empirical | Expected
-----------------------------------------------
price_hunter | 3,445 | 0.345 | 0.350
pl_lover | 2,596 | 0.260 | 0.250
premium | 1,488 | 0.149 | 0.150
litter_heavy | 2,471 | 0.247 | 0.250
- Preference scatter plots (15 min)
- Create 2x2 subplot grid (one subplot per segment)
- Each subplot: scatter plot of θ_price (x-axis) vs. θ_pl (y-axis)
- Mark segment mean with large red star
- Add horizontal/vertical lines at zero
- Add title with segment name
python
# Your code here
**Expected pattern:**
- Price hunters: Strong negative θ_price, negative θ_pl (avoid PL)
- Premium: Positive θ_price, negative θ_pl (avoid PL)
- PL lovers: Moderate negative θ_price, strong positive θ_pl
- Litter heavy: Moderate negative θ_price, positive θ_pl
- Category affinity validation (10 min)
- For litter-heavy segment, compute mean category affinity vector
- Print probabilities for each category
- Verify litter affinity ≈ 0.85 (85%)
python
litter_heavy_users = [u for u in users if u.segment == "litter_heavy"]
# Your code here
Expected output:
Litter-heavy segment category affinities:
dog_food: 0.053
cat_food: 0.047
litter: 0.851 ← Should be ~85%
toys: 0.049
Exercise 3: Query Intent Coupling (30 minutes)¶
Objective: Verify query intents align with user category affinities.
Search logs are full of hints about what customers actually want: some queries scream “litter refill”, others quietly suggest “browse toys while I’m here”. In a healthy system, the distribution of query intents should line up with who is visiting the site. This exercise checks that our simulator respects that principle: litter‑heavy users should fire more litter queries, premium users should lean toward food queries, and the overall query‑type mix should look like a real e‑commerce search bar.
Setup:
from zoosim.world.queries import sample_query
cfg = SimulatorConfig(seed=2025_1108)
rng = np.random.default_rng(cfg.seed)
# Generate 5,000 (user, query) pairs
user_query_pairs = []
for _ in range(5_000):
user = sample_user(config=cfg, rng=rng)
query = sample_query(user=user, config=cfg, rng=rng)
user_query_pairs.append((user, query))
Tasks:
- Query type distribution (10 min)
- Count queries by type (category, brand, generic)
- Compute empirical probabilities
- Verify matches
cfg.queries.query_type_mix(60% category, 20% brand, 20% generic)
python
# Your code here
Expected output:
Query Type | Count | Empirical | Expected
------------------------------------------
category | 3,003 | 0.601 | 0.600
brand | 1,002 | 0.200 | 0.200
generic | 995 | 0.199 | 0.200
- Intent rate by segment (15 min)
- For each segment, compute percentage of queries that have intent_category = "litter"
- Compare to expected litter affinity from Dirichlet concentration parameters
python
# Your code here
# Hint: Dirichlet mean for dimension k is α_k / Σ_j α_j
Expected output:
Segment | Litter Query Rate | Expected Affinity
-------------------------------------------------------
price_hunter | 0.213 | ~0.200
pl_lover | 0.212 | ~0.200
premium | 0.045 | ~0.050
litter_heavy | 0.860 | ~0.850
- Embedding similarity (5 min)
- For first 100 (user, query) pairs, compute cosine similarity between user.theta_emb and query.phi_emb
- Compute mean similarity
- Verify high similarity (> 0.8) since query embedding is user embedding + small noise
```python import torch.nn.functional as F
# Your code here # Hint: Use F.cosine_similarity() or manual dot product / norms ```
Expected result: Mean similarity ≈ 0.95-0.99 (query ≈ user preference)
Exercise 4: Determinism Verification (15 minutes)¶
Objective: Verify same seed produces identical worlds.
In a production A/B test, traffic cannot be replayed; we only see it once. Reproducible simulators are the opposite: they should allow us to rewind and replay exactly the same synthetic experiment to debug a policy change or a subtle regression. This exercise formalizes that discipline: with the same seed we must recover the same catalog and users, bit‑for‑bit, so that any change in results can be traced back to code or configuration—not to random noise.
Tasks:
- Identical catalogs (5 min)
- Generate two catalogs with seed 42
- Assert products are identical at indices [0, 10, 100, 999]
- Check: price, cm2, category, is_pl, embedding
```python cfg = CatalogConfig()
catalog1 = generate_catalog(cfg, np.random.default_rng(42)) catalog2 = generate_catalog(cfg, np.random.default_rng(42))
# Your assertions here # For embeddings: use torch.equal(emb1, emb2) ```
- Different catalogs with different seeds (5 min)
- Generate two catalogs with seeds 42 and 123
- Assert products differ at index 0
- Verify at least one attribute is different
python
# Your code here
- Full world determinism (5 min)
- Generate 100 users with seed 2025
- Generate 100 users with seed 2025 again
- Assert all users have identical segments and preferences
python
# Your code here
Success criteria: All assertions pass, demonstrating [EQ-4.10] from chapter.
Exercise 5: Domain Randomization (45 minutes, Advanced)¶
Objective: Implement domain randomization for robust policy training.
Background: Policies robust to simulator variability often transfer better to production (sim-to-real transfer). We randomize parameters to create diverse training environments.
Think of launching the same ranking policy in ten different countries or seasons: prices, margins, and customer mixes all shift, sometimes dramatically. If we tune an agent to a single “average” configuration, it will often break in at least one of those markets. Domain randomization is the simulator analogue of that reality check: by sampling slightly different but plausible worlds, we force the policy to learn behaviors that survive small changes in catalog economics and audience composition.
Tasks:
- Randomization function (15 min)
- Implement
randomize_config(base_cfg, rng, perturbation=0.1) - Perturb price distribution parameters:
μ ± perturbation,σ ± perturbation - Perturb margin slopes:
β ± perturbation - Perturb segment mix probabilities (renormalize to sum to 1)
- Return new
SimulatorConfig
```python def randomize_config(base_cfg: SimulatorConfig, rng: np.random.Generator, perturbation: float = 0.1) -> SimulatorConfig: """Create randomized configuration for domain randomization.
Perturbs:
- Catalog price parameters (μ, σ)
- Margin slopes (β)
- Segment mix probabilities
Args:
base_cfg: Base configuration
rng: Random generator
perturbation: Relative perturbation magnitude (default 0.1 = ±10%)
Returns:
Randomized configuration
"""
# Your implementation here
# Hint: Use copy.deepcopy(base_cfg) and modify in place
pass
```
- Generate ensemble (10 min)
- Create 10 randomized configurations
- For each, generate a 1,000-product catalog
- Compute mean litter CM2 for each configuration
- Print distribution of mean litter CM2 across configurations
python
# Your code here
Expected output: Range of mean litter CM2 values (e.g., [-0.35, -0.15])
- Robustness experiment (optional, 20 min, requires Chapter 6)
- Implement simple LinUCB bandit (preview of Chapter 6)
- Train on base configuration for 1,000 episodes
- Evaluate on 10 randomized configurations
- Measure GMV degradation:
(GMV_base - GMV_randomized) / GMV_base - Plot histogram of degradation across configurations
python
# This requires Chapter 6 LinUCB implementation
# Skip if not yet covered
Conceptual question: - Why does training on randomized configurations improve robustness? - What is the trade-off between realism and randomization?
Answer (brief): Randomization forces policy to learn features robust to distribution shift, avoiding overfitting to specific parameter values. Trade-off: Too much randomization creates unrealistic scenarios the policy will never see, wasting training data.
Exercise 6: Statistical Tests (20 minutes, Optional)¶
Objective: Apply formal statistical tests to generated data.
Data scientists in large e‑commerce companies do not stop at eyeballing histograms; they routinely run goodness‑of‑fit tests to catch subtle drifts and modeling mistakes. If simulated dog‑food prices stop looking lognormal, or the segment mix no longer matches the business definition, any conclusions drawn by RL agents become suspect. This exercise gives us a light‑weight version of that toolkit: formal tests that say “this looks consistent with our assumptions” rather than relying on visual judgment alone.
Tasks:
- Goodness-of-fit test for lognormal prices (10 min)
- Generate 1,000 dog food products
- Extract prices
- Apply Kolmogorov-Smirnov test: Is data consistent with LogNormal(2.6, 0.4)?
```python from scipy.stats import lognorm, kstest
cfg = CatalogConfig() rng = np.random.default_rng(42)
# Generate products and extract dog_food prices # ...
# KS test # lognorm parameters: s=sigma, scale=exp(mu) result = kstest(prices, lambda x: lognorm.cdf(x, s=0.4, scale=np.exp(2.6))) print(f"KS statistic: {result.statistic:.4f}, p-value: {result.pvalue:.4f}")
# If p-value > 0.05, fail to reject null hypothesis (data is lognormal) ```
- Chi-square test for segment distribution (10 min)
- Sample 10,000 users
- Count users per segment
- Apply chi-square goodness-of-fit test
- Null hypothesis: Observed counts match expected probabilities from
segment_mix
```python from scipy.stats import chisquare
# Your code here # Hint: chisquare(observed_counts, expected_counts) ```
Expected results: Both tests should fail to reject null hypotheses (p > 0.05), confirming our generator matches specified distributions.
Exercise 7: Convergence of Catalog Statistics (20 minutes, Optional)¶
Objective: Verify law of large numbers for lognormal price means.
In Chapter 1 we treated expectations as mathematical objects. Here we get to see one of those expectations—the mean of a lognormal price distribution—emerge empirically as we increase catalog size. This is exactly the kind of sanity check production teams run before trusting summary dashboards or offline simulations.
Tasks:
- Mean price vs. catalog size (15 min)
- Consider dog-food prices, which follow \(\text{LogNormal}(\mu=2.6, \sigma=0.4)\)
- Recall from Chapter 1: \(\mathbb{E}[\text{price}] = e^{\mu + \sigma^2/2}\)
- For \(N \in \{100, 500, 1000, 5000, 10{,}000, 50{,}000\}\):
- Create a
CatalogConfigwithn_products = N - Generate a catalog with fixed seed 42
- Compute mean price for the dog_food category
- Create a
```python import numpy as np import matplotlib.pyplot as plt from zoosim.core.config import CatalogConfig from zoosim.world.catalog import generate_catalog
Ns = [100, 500, 1000, 5000, 10_000, 50_000] mu, sigma = 2.6, 0.4 true_mean = np.exp(mu + sigma**2 / 2)
mean_prices = [] for N in Ns: cfg = CatalogConfig(n_products=N) rng = np.random.default_rng(42) catalog = generate_catalog(cfg, rng) dog_prices = [p.price for p in catalog if p.category == "dog_food"] mean_prices.append(np.mean(dog_prices))
plt.figure(figsize=(8, 4)) plt.plot(Ns, mean_prices, marker="o", label="Empirical mean (dog_food)") plt.axhline(true_mean, color="red", linestyle="--", label=f"Theoretical mean = e^(μ+σ²/2) ≈ \({true_mean:.2f}") plt.xscale("log") plt.xlabel("Catalog size N (log scale)") plt.ylabel("Mean price (\))") plt.title("Convergence of Dog-Food Mean Price") plt.legend() plt.grid(alpha=0.3) plt.savefig("catalog_mean_price_convergence.png", dpi=150) print("Saved plot to catalog_mean_price_convergence.png")
print("\nMean dog-food prices by N:") for N, m in zip(Ns, mean_prices): print(f" N={N:6d}: mean=${m:6.2f}") ```
**Output (representative):**
```
Mean dog-food prices by N:
N= 100: mean=$13.83
N= 500: mean=$14.02
N= 1000: mean=$14.54
N= 5000: mean=$14.62
N= 10000: mean=$14.54
N= 50000: mean=$14.55
Theoretical mean (e^{2.6 + 0.4^2/2}) ≈ $14.59
```
As \(N\) grows, the empirical mean converges to the theoretical value, illustrating the law of large numbers in the concrete setting of catalog statistics.
- Conceptual reflection (5 min)
- Why does the simulator default to \(N = 10{,}000\) products?
- What breaks if we only use \(N = 100\) products for RL training?
Hint: Think about variance of estimates, coverage of rare but important products, and the stability of downstream policy gradients.
Lab: Complete World Generation Pipeline (30 minutes)¶
Objective: Integrate catalog, users, and queries into a complete world generation workflow.
In production settings, teams often maintain nightly “world snapshots”: rolled‑up statistics and JSON dumps that downstream dashboards, notebooks, and training jobs consume. The goal is not to store every click, but to have a coherent view of “what the world looked like” on a given day. This lab has the same flavor. We wire together catalog, user, and query generation and materialize a small, self‑contained snapshot that other chapters—and future experiments—can reuse without regenerating everything from scratch.
Task:
Write a script that: 1. Loads configuration from file or default 2. Generates catalog (10,000 products) 3. Samples 1,000 users 4. For each user, samples 5 queries 5. Saves results to disk (CSV or JSON) 6. Prints summary statistics
Starter code:
import json
from pathlib import Path
def generate_world(config: SimulatorConfig, output_dir: Path):
"""Generate complete world and save to disk.
Args:
config: Simulator configuration
output_dir: Directory to save results
"""
output_dir.mkdir(parents=True, exist_ok=True)
rng = np.random.default_rng(config.seed)
# 1. Generate catalog
print("Generating catalog...")
catalog = generate_catalog(config.catalog, rng)
# Save catalog statistics (not full catalog, too large)
catalog_stats = {
"n_products": len(catalog),
"price_mean_by_category": {
cat: float(np.mean([p.price for p in catalog if p.category == cat]))
for cat in config.catalog.categories
},
# Add more statistics here
}
with open(output_dir / "catalog_stats.json", "w") as f:
json.dump(catalog_stats, f, indent=2)
# 2. Generate users
print("Generating users...")
# Your code here
# 3. Generate queries
print("Generating queries...")
# Your code here
# 4. Save and print summary
print("\nSummary:")
print(f" Catalog: {len(catalog)} products")
# Your summary here
if __name__ == "__main__":
cfg = SimulatorConfig(seed=2025_1108)
generate_world(cfg, Path("./world_output"))
Deliverables:
- catalog_stats.json: Price/margin statistics by category
- users.json: User segments and aggregate statistics
- queries.json: Query type distribution and intent coupling metrics
- Console output with summary statistics
Success criteria: - All files generated - Statistics match expectations from chapter - Script runs in < 30 seconds on standard laptop
Bonus Challenge: Catalog Embeddings Visualization (Optional)¶
Objective: Visualize product embeddings in 2D using dimensionality reduction.
Modern search teams routinely project high‑dimensional embeddings down to two or three dimensions to debug models and explain behavior to stakeholders. If “dog food” and “litter” products are hopelessly entangled in embedding space, no amount of clever ranking logic will fully fix relevance. This bonus challenge gives us the simulator version of that diagnostic: we look at the learned‑by‑construction clusters and convince ourselves that categories are separated the way a human merchandiser would expect.
Tasks:
- Generate catalog with 10,000 products
- Extract embeddings (16D) for all products
- Apply UMAP or t-SNE to reduce to 2D
- Create scatter plot colored by category
- Verify products cluster by category around shared centroids
Starter code:
from umap import UMAP # pip install umap-learn
# or: from sklearn.manifold import TSNE
import torch
# Generate catalog
cfg = CatalogConfig(n_products=10_000)
rng = np.random.default_rng(42)
catalog = generate_catalog(cfg, rng)
# Extract embeddings
embeddings = torch.stack([p.embedding for p in catalog]).numpy() # (10000, 16)
# Reduce to 2D
reducer = UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
embeddings_2d = reducer.fit_transform(embeddings)
# Plot
plt.figure(figsize=(10, 8))
for cat in cfg.categories:
mask = [p.category == cat for p in catalog]
plt.scatter(embeddings_2d[mask, 0], embeddings_2d[mask, 1],
label=cat, alpha=0.5, s=10)
plt.legend()
plt.title("Product Embeddings (UMAP projection)")
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.savefig("embeddings_umap.png", dpi=150)
plt.show()
Expected result: Four distinct clusters (one per category), each concentrated around a distinct category centroid with varying tightness based on emb_cluster_std from config.
Solution Hints¶
Exercise 1:
- Use np.percentile(prices, [25, 50, 75])
- Theoretical median: np.exp(cfg.catalog.price_params[cat]["mu"])
- Margin verification: litter_cm2 = [p.cm2 for p in catalog if p.category == "litter"]
Exercise 2:
- Segment counting: collections.Counter([u.segment for u in users])
- Category affinity: np.mean([u.theta_cat for u in litter_heavy_users], axis=0)
Exercise 3:
- Query type: collections.Counter([q.query_type for _, q in user_query_pairs])
- Intent coupling: Group by segment, compute fraction where q.intent_category == "litter"
Exercise 4:
- Embedding comparison: torch.equal(catalog1[idx].embedding, catalog2[idx].embedding)
Exercise 5:
- Deep copy config: import copy; new_cfg = copy.deepcopy(base_cfg)
- Perturb: new_mu = mu * (1 + rng.uniform(-perturbation, perturbation))
- Renormalize simplex: new_probs / new_probs.sum()
Testing Solutions¶
Run all exercises in a Jupyter notebook or as Python scripts. Expected total runtime: ~20 minutes (excluding optional exercises).
Validation: - All numerical results should match expected outputs within ±5% (stochastic variation) - Plots should show expected patterns (clusters, correlations, distributions) - Determinism tests should pass exactly (no variation allowed)
Common issues:
- RNG state: Always create new rng = np.random.default_rng(seed) before each exercise
- Tensor comparisons: Use torch.equal() for exact equality, not ==
- Floating-point precision: Use np.allclose(a, b, rtol=1e-5) instead of a == b
Discussion Questions¶
-
Realism vs. Simplicity: Our simulator uses lognormal prices and linear margins. What real-world phenomena do we miss? (seasonality, promotions, competitor pricing)
-
Segment heterogeneity: We have 4 segments. Production might have 100+. How should we:
- Learn segments from data (clustering, mixture models)?
-
Handle continuous preference distributions instead of discrete segments?
-
Sim-to-real gap: If production transfer fails (sim-trained policy performs poorly), what debugging steps should we take?
- Compare distributions (price, CTR, query types)
- Check feature coverage (are production features in simulator?)
- Evaluate on randomized configurations (domain randomization)
-
Fine-tune with offline RL on production logs (Chapter 13)
-
Embedding generation: We use Gaussian clusters. In production, embeddings come from learned models (Word2Vec, transformers). What properties must these embeddings have for our simulator to be realistic?
- Smooth: Similar products → similar embeddings
- Separable: Different categories → distinguishable
-
Aligned with user preferences: User query embedding → high similarity with relevant products
-
Scalability: Our simulator generates 10K products, 10K users. Production has 100M+ products, billions of users. What computational bottlenecks arise?
- Catalog generation: Vectorize with NumPy instead of Python loops
- Embedding storage: Use approximate nearest neighbors (FAISS, Annoy)
- User sampling: Pre-generate user population, sample from cache
End of Chapter 4 Exercises & Labs
These exercises reinforce the generative world model concepts from Chapter 4. By completing them, we will have hands-on experience with: - Catalog generation and statistical validation - User segment modeling and preference distributions - Query intent coupling and embedding similarity - Deterministic reproducibility (critical for RL experiments) - Domain randomization for robust policy learning
Next: Chapter 5 — Relevance, Features, and Counterfactual Ranking