Book Outline¶
This is our map. Each chapter builds on the previous, but you can jump ahead if you're comfortable with the prerequisites.
For detailed objectives, labs, and acceptance criteria for each chapter, see syllabus.md.
Part I — Foundations¶
The mathematical machinery you need before touching a single line of RL code.
| Ch | Title | Main File | Code |
|---|---|---|---|
| 0 | A Tiny Search Engine That Learns | Motivation and first experiment | scripts/ch00/toy_problem_solution.py |
| 1 | Search Ranking as Optimization | Reward design, contextual bandits, constraints | zoosim/dynamics/reward.py |
| 2 | Probability, Measure, and Click Models | PBM/DBN, position bias, stopping times | zoosim/dynamics/behavior.py |
| 3 | Stochastic Processes and Bellman Foundations | Stochastic processes, MDPs, Bellman operators, contractions | — |
Part II — The Simulator¶
Where the math meets synthetic data. You'll build a complete search environment.
| Ch | Title | Main File | Code |
|---|---|---|---|
| 4 | Catalog, Users, Queries | Generative design with seeds | zoosim/world/ |
| 5 | Relevance, Features, Reward | Hybrid scoring, feature engineering | zoosim/ranking/ |
Part III — Policies¶
From discrete bandits to continuous optimization to policy gradients.
| Ch | Title | Main File | Code |
|---|---|---|---|
| 6 | Discrete Template Bandits | LinUCB, Thompson Sampling | zoosim/policies/ |
| 7 | Continuous Actions via Q(x,a) | Value regression, uncertainty, trust regions | zoosim/optimizers/ |
| 8 | Policy Gradient Methods | REINFORCE, variance reduction | — |
Part IV — Evaluation and Deployment¶
How to know if your policy works without breaking production.
| Ch | Title | Main File | Code |
|---|---|---|---|
| 9 | Off-Policy Evaluation | IPS, SNIPS, DR, FQE | zoosim/evaluation/ |
| 10 | Robustness and Guardrails | (in progress) | zoosim/monitoring/ |
| 11 | Multi-Episode Retention | (planned) | zoosim/multi_episode/ |
Part V — Frontier Methods (Planned)¶
Where the field is heading. These chapters are on the roadmap.
- Ch 12: Slate RL & Differentiable Ranking
- Ch 13: Offline RL (CQL, IQL, TD3+BC)
- Ch 14: Multi-Objective RL & Fairness
- Ch 15: Non-Stationarity & Meta-Adaptation
Appendices¶
Foundational mathematics supporting multiple chapters. Read as needed based on background.
| App | Title | Main File | Dependencies |
|---|---|---|---|
| A | Bayesian Preference Models | Hierarchical priors, shrinkage, bandit integration | Ch06 |
| B | Control-Theoretic Background | LQR, HJB, deep RL timeline | Ch01, Ch03 |
| C | Convex Optimization for Constrained MDPs | Lagrangian duality, Slater's condition | Ch01 §1.9; Ch10 (guardrails context); Ch14 (primal--dual constrained RL) |
| D | Information-Theoretic Lower Bounds | KL divergence, Fano's inequality, bandit lower bounds | Ch01 §1.7.6; Ch06 (THM-6.0) |
| E | Vector-Reward Multi-Objective RL | Pareto Q-learning, coverage sets, supported vs unsupported points | Ch14 (multi-objective context) |
When to read: - Appendix A: When implementing Thompson Sampling or LinUCB with rich features (Chapter 6), or when modeling user preferences in the simulator (Chapter 4) - Appendix B: If you have control theory background (LQR, HJB) and want to see connections to RL, or when control-theoretic tools appear in Chapters 8, 10, and 11 - Appendix C: For the Lagrangian-duality foundations used in Chapter 1 (§1.9), for the conceptual background behind constraints and guardrails in Chapter 10, and before implementing primal--dual constrained RL in Chapter 14 - Appendix D: When you want to understand why \(\Omega(\sqrt{KT})\) regret is unavoidable for bandits, or when Chapter 6 references the minimax lower bound - Appendix E: When Chapter 14's "multi-objective" framing raises questions about true vector-reward MORL, Pareto Q-learning, or when CMDP/\(\varepsilon\)-constraint is insufficient
Quick Reference¶
Each chapter folder contains:
- Main content (ch0X_*.md)
- Exercises (exercises_labs.md)
- Lab solutions (ch0X_lab_solutions.md)
- Archive of earlier drafts (archive/)
The simulator lives in zoosim/:
- core/ — Configuration
- world/ — Catalog, users, queries
- ranking/ — Relevance, features
- dynamics/ — Click models, rewards
- envs/ — Gymnasium interface
- policies/ — Agents (bandits, Q-learning)
- evaluation/ — OPE estimators
- monitoring/ — Drift detection, guardrails
To run tests:
pytest -q
To serve the book locally:
mkdocs serve