Book Outline¶

This is our map. Each chapter builds on the previous, but you can jump ahead if you're comfortable with the prerequisites.

For detailed objectives, labs, and acceptance criteria for each chapter, see syllabus.md.

Part I — Foundations¶

The mathematical machinery you need before touching a single line of RL code.

Ch	Title	Main File	Code
0	A Tiny Search Engine That Learns	Motivation and first experiment	`scripts/ch00/toy_problem_solution.py`
1	Search Ranking as Optimization	Reward design, contextual bandits, constraints	`zoosim/dynamics/reward.py`
2	Probability, Measure, and Click Models	PBM/DBN, position bias, stopping times	`zoosim/dynamics/behavior.py`
3	Stochastic Processes and Bellman Foundations	Stochastic processes, MDPs, Bellman operators, contractions	—

Part II — The Simulator¶

Where the math meets synthetic data. You'll build a complete search environment.

Ch	Title	Main File	Code
4	Catalog, Users, Queries	Generative design with seeds	`zoosim/world/`
5	Relevance, Features, Reward	Hybrid scoring, feature engineering	`zoosim/ranking/`

Part III — Policies¶

From discrete bandits to continuous optimization to policy gradients.

Ch	Title	Main File	Code
6	Discrete Template Bandits	LinUCB, Thompson Sampling	`zoosim/policies/`
7	Continuous Actions via Q(x,a)	Value regression, uncertainty, trust regions	`zoosim/optimizers/`
8	Policy Gradient Methods	REINFORCE, variance reduction	—

Part IV — Evaluation and Deployment¶

How to know if your policy works without breaking production.

Ch	Title	Main File	Code
9	Off-Policy Evaluation	IPS, SNIPS, DR, FQE	`zoosim/evaluation/`
10	Robustness and Guardrails	(in progress)	`zoosim/monitoring/`
11	Multi-Episode Retention	(planned)	`zoosim/multi_episode/`

Part V — Frontier Methods (Planned)¶

Where the field is heading. These chapters are on the roadmap.

Ch 12: Slate RL & Differentiable Ranking
Ch 13: Offline RL (CQL, IQL, TD3+BC)
Ch 14: Multi-Objective RL & Fairness
Ch 15: Non-Stationarity & Meta-Adaptation

Appendices¶

Foundational mathematics supporting multiple chapters. Read as needed based on background.

App	Title	Main File	Dependencies
A	Bayesian Preference Models	Hierarchical priors, shrinkage, bandit integration	Ch06
B	Control-Theoretic Background	LQR, HJB, deep RL timeline	Ch01, Ch03
C	Convex Optimization for Constrained MDPs	Lagrangian duality, Slater's condition	Ch01 §1.9; Ch10 (guardrails context); Ch14 (primal--dual constrained RL)
D	Information-Theoretic Lower Bounds	KL divergence, Fano's inequality, bandit lower bounds	Ch01 §1.7.6; Ch06 (THM-6.0)
E	Vector-Reward Multi-Objective RL	Pareto Q-learning, coverage sets, supported vs unsupported points	Ch14 (multi-objective context)

When to read: - Appendix A: When implementing Thompson Sampling or LinUCB with rich features (Chapter 6), or when modeling user preferences in the simulator (Chapter 4) - Appendix B: If you have control theory background (LQR, HJB) and want to see connections to RL, or when control-theoretic tools appear in Chapters 8, 10, and 11 - Appendix C: For the Lagrangian-duality foundations used in Chapter 1 (§1.9), for the conceptual background behind constraints and guardrails in Chapter 10, and before implementing primal--dual constrained RL in Chapter 14 - Appendix D: When you want to understand why \(\Omega(\sqrt{KT})\) regret is unavoidable for bandits, or when Chapter 6 references the minimax lower bound - Appendix E: When Chapter 14's "multi-objective" framing raises questions about true vector-reward MORL, Pareto Q-learning, or when CMDP/\(\varepsilon\)-constraint is insufficient

Quick Reference¶

Each chapter folder contains: - Main content (ch0X_*.md) - Exercises (exercises_labs.md) - Lab solutions (ch0X_lab_solutions.md) - Archive of earlier drafts (archive/)

The simulator lives in zoosim/: - core/ — Configuration - world/ — Catalog, users, queries - ranking/ — Relevance, features - dynamics/ — Click models, rewards - envs/ — Gymnasium interface - policies/ — Agents (bandits, Q-learning) - evaluation/ — OPE estimators - monitoring/ — Drift detection, guardrails

To run tests:

pytest -q

To serve the book locally:

mkdocs serve

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search