Chapter 3 --- Exercises & Labs

We use these labs to keep the operator-theoretic proofs in sync with runnable Bellman code. Each snippet is self-contained so we can execute it directly (e.g., .venv/bin/python - <<'PY' ... PY) while cross-referencing Sections 3.7--3.9.

Lab 3.1 --- Contraction Ratio Tracker

Objective: log \(\| \mathcal{T}V_1 - \mathcal{T}V_2 \|_\infty / \|V_1 - V_2\|_\infty\) and compare it to \(\gamma\).

import numpy as np

gamma = 0.9
P = np.array(
    [
        [[0.7, 0.3, 0.0], [0.4, 0.6, 0.0]],
        [[0.0, 0.6, 0.4], [0.0, 0.3, 0.7]],
        [[0.2, 0.0, 0.8], [0.1, 0.0, 0.9]],
    ]
)
R = np.array(
    [
        [1.0, 0.5],
        [0.8, 1.2],
        [0.0, 0.4],
    ]
)

def bellman_operator(V, P, R, gamma):
    Q = R + gamma * np.einsum("ijk,k->ij", P, V)
    return Q.max(axis=1)

rng = np.random.default_rng(0)
V1 = rng.normal(size=3)
V2 = rng.normal(size=3)
ratio = np.linalg.norm(
    bellman_operator(V1, P, R, gamma) - bellman_operator(V2, P, R, gamma),
    ord=np.inf,
) / np.linalg.norm(V1 - V2, ord=np.inf)
print(f"Contraction ratio: {ratio:.3f} (theory bound = {gamma:.3f})")

Output:

Contraction ratio: 0.872 (theory bound = 0.900)

Tasks 1. Explain the slack between the bound and the observation (hint: the max operator is 1-Lipschitz, so the true ratio is often strictly less than \(\gamma\)). 2. Log the ratio across multiple seeds and include the extrema in Chapter 3 to make [EQ-3.16] concrete.

Lab 3.2 --- Value Iteration Wall-Clock Profiling

Objective: verify the \(O\!\left(\frac{1}{1-\gamma}\right)\) convergence rate numerically by reusing the same toy kernel.

import numpy as np

P = np.array(
    [
        [[0.7, 0.3, 0.0], [0.4, 0.6, 0.0]],
        [[0.0, 0.6, 0.4], [0.0, 0.3, 0.7]],
        [[0.2, 0.0, 0.8], [0.1, 0.0, 0.9]],
    ]
)
R = np.array(
    [
        [1.0, 0.5],
        [0.8, 1.2],
        [0.0, 0.4],
    ]
)

def value_iteration(P, R, gamma, tol=1e-6, max_iters=500):
    V = np.zeros(P.shape[0])
    for k in range(max_iters):
        Q = R + gamma * np.einsum("ijk,k->ij", P, V)
        V_new = Q.max(axis=1)
        if np.linalg.norm(V_new - V, ord=np.inf) < tol:
            return V_new, k + 1
        V = V_new
    raise RuntimeError("Value iteration did not converge")

stats = {}
for gamma in [0.5, 0.7, 0.9]:
    _, iters = value_iteration(P, R, gamma=gamma)
    stats[gamma] = iters
print(stats)

Output:

{0.5: 21, 0.7: 39, 0.9: 128}

Tasks 1. Plot the iteration counts against \(\frac{1}{1-\gamma}\) and reference the figure when explaining [COR-3.7.3] (value iteration convergence rate). 2. Re-run the sweep after perturbing R with zero-mean noise to visualize the reward-perturbation sensitivity bound [PROP-3.7.4].