Book Syllabus — RL Search From Scratch (Vlad Prytula)¶

This syllabus lays out a chapter-by-chapter path from foundations and a working simulator (zoosim) to frontier methods for ranking and reinforcement learning. Each chapter lists objectives, theory, implementation hooks, labs, and acceptance criteria so progress is estimable and auditable.

Part I — Foundations¶

0) Chapter 0 — Motivation: A Tiny Search Engine That Learns (docs/book/ch00/ch00_motivation_first_experiment_revised.md) - Objectives: Run a complete small RL experiment (toy bandit with boosts); build intuition for exploration, reward design, and regret. - Theory: Minimal and informal in Ch0; rigor deferred to Ch1–Ch3. - Implementation: scripts/ch00/toy_problem_solution.py; sanity in tests/ch00/test_toy_example.py. - Labs: Plot learning curves; compare to baselines; simple CM2 floor; action‑space geometry. - Acceptance: Reproduce representative outputs in Ch0; explain limitations and the need for formalization.

1) Chapter 1 — Search Ranking as Optimization (docs/book/ch01/ch01_foundations_revised_math+pedagogy_v3.md) - Objectives: Formalize multi-objective reward and the contextual bandit formulation for the Chapter 0 toy and real search; introduce #EQ-1.2 and the multi-episode preview #EQ-1.2-prime. - Theory: Contextual bandit vs MDP; constraints (CM2, exposure, rank stability); engagement as soft viability. - Implementation: Reward weights in zoosim/core/config.py and aggregation in zoosim/dynamics/reward.py. - Labs: Compute reward under different weights; check RPC diagnostic; verify rank-stability penalty wiring. - Acceptance: Reproduce example outputs in ch01; add assertion bound on δ/α in reward code.

2) Chapter 2 — Probability, Measure, and Click Models (PBM/DBN) - Objectives: Model examination/click/purchase; define position bias and abandonment. - Theory: PBM/DBN abstractions; stopping times; unbiased estimators via propensities. - Implementation: Click/exam/buy paths in zoosim/dynamics/behavior.py; position-bias knobs in zoosim/core/config.py. - Labs: Plot CTR vs rank by query type; simulate abandonment vs utility. - Acceptance: CTR curves monotone; abandonment 10–15% on low utility sessions.

3) Chapter 3 — Stochastic Processes & Bellman Foundations - Objectives: Connect single-step bandit to MDP/CMDP; introduce Bellman operators and contractions. - Theory: Value functions; constraints via Lagrangian; preview of OPE as model-based DM. - Implementation: No new code; formalism supports later chapters. - Labs: Verify contraction on a toy MDP. - Acceptance: Proof sketches compile and tie to later code (cross-referenced from labs).

Part II — Simulator and Data¶

4) Chapter 4 — Catalog, Users, Queries (Generative Design) - Objectives: Deterministic world generation with seeds; realistic priors. - Implementation: zoosim/world/{catalog,users,queries}.py, config defaults in zoosim/core/config.py; demo script scripts/ch04/ch04_demo.py. - Labs: Compute catalog price/margin stats; verify deterministic generation given seed. - Acceptance: uv run pytest -q passes tests/test_catalog_stats.py and determinism checks.

5) Chapter 5 — Relevance, Features, Reward - Objectives: Hybrid lexical/semantic relevance; engineered boost features; reward aggregation. - Implementation: zoosim/ranking/{relevance,features}.py, zoosim/dynamics/reward.py; config-driven constants; demo script scripts/ch05/ch05_demo.py. - Labs: Visualize feature interactions (cm2×cat, discount×price_sens); validate reward breakdown. - Acceptance: Feature standardization applied; reward equals weighted sum expected from lab.

6) Chapter 6 — Discrete Template Bandits (LinUCB/TS) - Objectives: Safe, interpretable baseline; honest failure modes; feature-driven improvement. - Implementation: zoosim/policies/{templates,lin_ucb,thompson_sampling}.py; scripts/ch06/template_bandits_demo.py. - Labs: Simple vs rich context features; per-segment diagnostics; template selection analysis. - Acceptance: Simple-features bandits underperform a strong static template; rich features measurably reduce the GMV gap vs static on average; per-segment heterogeneity and theory–practice gap clearly documented.

6A) Chapter 6A — Neural Bandits: From Neural Linear to NL-Bandit (optional bridge chapter) - Objectives: Learn features with neural networks; bridge linear bandits (Ch6) → continuous Q(x,a) (Ch7). - Theory: Neural Linear architecture (f_ψ + linear heads); discrete NL-Bandit (neural Q(x,a) on templates); sample complexity trade-offs. - Implementation: scripts/ch06a/{neural_linear_demo,neural_bandits_demo}.py; reuse policies/{lin_ucb,thompson_sampling,q_ensemble}.py. - Labs: Neural Linear vs rich features vs simple features; NL-Bandit ensemble calibration; overfit diagnosis (small data, large network); transfer learning across segments. - Acceptance: Neural Linear beats rich features when nonlinearity exists + data is abundant (target: +5-10% GMV); degrades gracefully when data is scarce (honest failure modes documented); ensemble uncertainty calibrated (σ correlates with |r − μ(x,a)| on held-out data, ρ ≥ 0.5).

7) Chapter 7 — Continuous Actions via Q(x,a) - Objectives: Regress E[R|x,a]; optimize bounded boosts with uncertainty and trust region. - Implementation: policies/q_ensemble.py, optimizers/cem.py. - Labs: UCB vs greedy selection; Δrank@k trust-region tuning. - Acceptance: +10–15% GMV vs best static template and ≥ Chapter 6 rich-feature bandits, under guardrail compliance and stable uncertainty calibration.

8) Chapter 8 — Policy Gradient Methods (REINFORCE) - Objectives: Derive the Policy Gradient Theorem; implement REINFORCE with Gaussian policies; compare on-policy policy gradients to Chapter 7’s value-based Q-learning and document the theory–practice gap. - Implementation: zoosim/policies/reinforce.py; training and visualization scripts in scripts/ch08/{reinforce_demo.py,neural_reinforce_demo.py,visualize_theory_practice_gap.py}. - Labs: Reproduce REINFORCE vs Q-learning learning curves; ablate entropy regularization and learning rate; analyze failure of deep end-to-end REINFORCE and explore RLHF-style extensions. - Acceptance: REINFORCE reliably improves over random and discrete bandits but underperforms continuous Q-learning (target: ≈2× gap documented and explained); entropy collapse and variance issues are reproduced and diagnosed in labs.

Part III — Evaluation, Robustness, Deployment¶

9) Chapter 9 — Off-Policy Evaluation (OPE) - Objectives: IPS/SNIPS/DR; SWITCH/MAGIC; FQE sanity checks. - Implementation: evaluation/ope.py with logging ε-mix protocol. - Labs: Reproduce OPE vs on-policy consistency on simulator data. - Acceptance: OPE estimates within 95% CI of online estimates on held-out seeds.

10) Chapter 10 — Robustness & Guardrails - Objectives: Drift detection; safe fallback; monitoring (Δrank@k, CTR, CVR, GMV, CM2, latency). - Implementation: monitoring/metrics.py; drift detectors (CUSUM/Page–Hinkley). - Labs: Inject synthetic drifts; verify alarms and rollback. - Acceptance: Automatic recovery to baseline within N episodes after drift.

11) Chapter 11 — Multi-Episode Inter-Session MDP (maps ch01 #EQ-1.2-prime) - Objectives: Introduce retention/satisfaction state; show engagement implicit via long-term value. - Theory: Hazard/survival modeling; episodic transitions s_{t+1}=f(s_t, clicks_t, buys_t, seasonality). - Implementation: zoosim/multi_episode/session_env.py, zoosim/multi_episode/retention.py (hazard + satisfaction dynamics). - Labs: Compare single-step proxy (δ·CLICKS) to multi-episode value (discounted GMV); retention curves by segment. - Acceptance: Policy ordering agrees ≥80% short horizon; retention monotone in engagement; value estimates stable across seeds.

Part IV — Frontier Methods¶

12) Chapter 12 — Slate RL & Differentiable Ranking - Objectives: Optimize top-k lists; add diversity/novelty; enable gradient-based ranking under constraints. - Implementation: policies/slate_q.py, policies/diff_ranker.py (SoftSort/NeuralSort/Gumbel–Sinkhorn wrappers), evaluation/ope_slate.py. - Labs: Diversity@k vs GMV trade-off; top-k OPE correction; compare gradient vs discrete slate methods. - Acceptance: +2–5% diversity@k at ≤1% GMV loss; OPE↔online agreement within 95% CI.

13) Chapter 13 — Offline RL for Boosts (Pessimistic) - Objectives: Train from logs safely; select conservative policies. - Implementation: policies/offline/{cql.py,iql.py,td3bc.py} for bounded continuous actions. - Labs: Train on logged mixtures; evaluate with OPE; ablate pessimism strength. - Acceptance: Offline policy ≥ NL-Bandit; no constraint regressions; stable under ε changes.

14) Chapter 14 — Multi-Objective RL & Fairness at Scale - Objectives: CMDP + Pareto policies via \(\varepsilon\)-constraint method; exposure/utility parity across segments and provider groups (private-label vs national-brand, category, strategic). - Implementation: zoosim/policies/mo_cmdp.py with primal-dual; fairness metrics in zoosim/evaluation/fairness.py. - Labs: Plot Pareto fronts; fairness gap sweeps (scripts/ch14/lab_01_pareto_fronts.py, scripts/ch14/lab_02_fairness_gap_sweeps.py). - Acceptance: <1% constraint violations; fair exposure within target bands with minimal GMV loss (price of fairness quantified).

15) Chapter 15 — Non‑Stationarity & Meta‑Adaptation - Objectives: Drift-aware bandits/RL; warm-start and adaptation playbooks. - Implementation: policies/adaptive_bandit.py; change-point aware exploration; schedule updates. - Labs: Regret under synthetic shifts; recovery speed. - Acceptance: Bounded regret; quick recovery to baseline; no guardrail breaches.

Part V — Appendices¶

Appendix A — Bayesian Preference Models (Hierarchical, Optional) - Objectives: Infer user-level price and PL preferences (θ_price, θ_pl) from logged interactions via hierarchical Bayes; connect these estimates to the rich context features used in Chapter 6. - Theory: Hierarchical priors over segments and users; partial pooling and shrinkage; simple hierarchical logistic/normal models; how posterior means and variances relate to the Bellman / value-function framing from Chapter 3. - Implementation: - Text: docs/book/appendix_a_bayesian_preference_models.md (complete Bayesian appendix with hierarchical preference models and Thompson-style bandit integration). - Future code hook: bayes/price_sensitivity_model.py (hierarchical model over segments and users, fit on simulator logs); exports posterior means as φ_rich_est-style features for Chapter 6 bandits. - Labs: Fit the model on simulated logs; compare posterior means to true simulator θ_price, θ_pl; plug these estimates into the template bandit (in place of oracle latents) and compare performance across φ_simple, φ_rich (oracle), and φ_rich_est (estimated). - Acceptance: Posterior estimates shrink sensibly toward segment means for low-data users; bandit performance with Bayesian θ̂ sits between φ_simple and oracle φ_rich, illustrating realistic gains from better preference modeling.

Appendix B — Control Theory Connections - Objectives: Bridge RL to classical control; show LQR as special case of value iteration; connect Bellman to HJB. - Theory: LQR/LQG; Riccati equations; Lyapunov stability; HJB equation as continuous-time Bellman. - Implementation: Text only (docs/book/appendix_b_control_theory.md); optional numerical demos. - Referenced by: Ch3 (Bellman operators), Ch7 (continuous actions), Ch11 (multi-episode dynamics).

Appendix C — Convex Optimization for RL - Objectives: Provide background on Lagrangian duality, KKT conditions, and primal-dual methods for constrained RL. - Theory: Convex sets and functions; Lagrangian and dual problems; KKT conditions; primal-dual algorithms. - Implementation: Text only (docs/book/appendix_c_convex_optimization.md). - Referenced by: Ch1 §1.9 (Lagrangian view), Ch10 (constraints), Ch14 (multi-objective CMDP).

Appendix D — Information-Theoretic Lower Bounds - Objectives: Prove fundamental limits on bandit learning; establish optimality of UCB/TS algorithms. - Theory: KL divergence and mutual information; data processing inequality; Fano's inequality (THM-D.1); bandit lower bound via hypothesis testing (THM-D.2: \(\Omega(\sqrt{KT})\)); extensions to contextual and continuous-action settings. - Implementation: Text only (docs/book/appendix_d_information_theory.md). - Referenced by: Ch1 §1.7.6 (regret preview), Ch6 THM-6.0 (formal lower bound statement). - Acceptance: Complete proofs for Fano's inequality and the \(\Omega(\sqrt{KT})\) lower bound; forward references from Ch1 and Ch6 resolve correctly.

Appendix E — Vector-Reward Multi-Objective RL - Objectives: Explain true vector-reward MORL (Pareto Q-learning, coverage sets) and contrast with CMDP/\(\varepsilon\)-constraint; justify why CMDP suffices for production ranking. - Theory: Vector-reward MDPs; Pareto dominance and Pareto fronts; Pareto Q-learning and Q-sets; coverage sets; supported vs unsupported Pareto points; hypervolume indicator. - Implementation: Text only (docs/book/appendix_e_vector_morl.md). - Referenced by: Ch14 bridge subsection (forward reference for "true MORL"); Ch14 update plan (CRITICAL PREREQUISITE requirement #3). - Acceptance: Clear taxonomy of single-policy vs multi-policy MORL; explanation of when CMDP is sufficient vs when true MORL is needed; supported/unsupported distinction sharpens \(\varepsilon\)-constraint vs scalarization comparison.

Milestones and Estimates¶

Part I (Ch 1–3): 1.5–2 weeks
Part II (Ch 4–8): 2.5–3 weeks
Part III (Ch 9–11): 2–2.5 weeks
Part IV (Ch 12–15): 3–4 weeks
Total (MVP + Frontier): ~9–11.5 weeks (parallelizable in Parts III–IV)

Cross‑References and “Code ↔ X” Conventions¶

Use equation anchors {#EQ-X.Y} immediately after tags and cite in prose (e.g., “By #EQ-1.2…”).
Place “Code ↔ X” admonitions in chapters to link theory to implementation with file references (line numbers when stable). For planned modules above, reference the file path until lines exist.

Quick Validation Commands¶

Environment

python -m venv .venv && source .venv/bin/activate
pip install uv && uv pip install -r pyproject.toml && uv pip install -e .

Smoke tests

uv run pytest -q

Output (representative):

2 passed in 3.2s

Viz example: reuse the listing embedded in Chapter 5 (seed 42) to reproduce the scatter/histogram plots referenced in the text.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search