Appendix E --- Vector-Reward Multi-Objective RL¶

Vlad Prytula

Motivation¶

Chapter 14 uses CMDP + $\varepsilon$-constraint sweeps to trace Pareto frontiers for search ranking. This appendix covers the alternative: vector-reward multi-objective RL (MORL), where the agent maintains a vector of Q-values and policies are compared by Pareto dominance rather than scalar reward.

We explain: 1. When this generalization is necessary 2. Why CMDP suffices for production ranking 3. The computational trade-offs between approaches

Prerequisites. This appendix assumes familiarity with MDP fundamentals (Chapter 3), Lagrangian duality (Appendix C), and the CMDP formulation (Chapter 14 $\S$14.1--14.3).

Why this matters. The Chapter 14 title "Multi-Objective RL" is accurate but requires context. True vector-reward MORL is a distinct and more general framework. This appendix provides that context, ensuring readers understand both what we deliver and what we omit.

E.1 Vector-Reward MDPs¶

E.1.1 Definition¶

Definition E.1.1 (Vector-Reward MDP) {#DEF-E.1.1}

A vector-reward MDP is a tuple $(\mathcal{S}, \mathcal{A}, P, \mathbf{r}, \gamma)$ where: - $\mathcal{S}$ is the state space - $\mathcal{A}$ is the action space - $P: \mathcal{S} \times \mathcal{A} \to \Delta(\mathcal{S})$ is the transition kernel - $\mathbf{r}: \mathcal{S} \times \mathcal{A} \to \mathbb{R}^m$ is a vector-valued reward function - $\gamma \in [0,1)$ is the discount factor

The return under policy $\pi$ is also a vector:

$$ \mathbf{G}^\pi = \mathbb{E}\pi\left[\sum^m \tag{E.1} $$}^{\infty} \gamma^t \mathbf{r}(s_t, a_t)\right] \in \mathbb{R

Example E.1.1 (Search ranking as vector MDP). In our setting, we could define: $$ \mathbf{r}(s,a) = \begin{pmatrix} \text{GMV}(s,a) \ \text{CM2}(s,a) \ -\Delta\text{rank}(s,a) \ -\text{exposure_gap}(s,a) \end{pmatrix} \in \mathbb{R}^4 $$

Each component is a distinct objective. The sign convention ensures all components should be maximized (costs are negated).

E.1.2 Pareto Dominance¶

Definition E.1.2 (Pareto Dominance) {#DEF-E.1.2}

For vectors $\mathbf{v}, \mathbf{v}' \in \mathbb{R}^m$, we say $\mathbf{v}$ Pareto dominates $\mathbf{v}'$ (written $\mathbf{v} \succ \mathbf{v}'$) if:

$$ \forall i \in {1, \ldots, m}: v_i \geq v'_i \quad \text{and} \quad \exists j: v_j > v'_j \tag{E.2} $$

That is, $\mathbf{v}$ is at least as good in all objectives and strictly better in at least one.

Definition E.1.3 (Pareto Optimal Policy) {#DEF-E.1.3}

A policy $\pi^*$ is Pareto optimal if there exists no policy $\pi$ such that $\mathbf{G}^\pi \succ \mathbf{G}^{\pi^*}$.

Definition E.1.4 (Pareto Front) {#DEF-E.1.4}

The Pareto front (or Pareto frontier) is the set of all return vectors achievable by Pareto optimal policies:

$$ \mathcal{P} = {\mathbf{G}^\pi : \pi \text{ is Pareto optimal}} \tag{E.3} $$

Remark E.1.1 (Non-uniqueness). Unlike scalar MDPs, vector MDPs typically have infinitely many Pareto optimal policies. The goal of MORL is either to (a) find a single policy given user preferences, or (b) approximate the entire Pareto front.

E.2 Pareto Q-Learning¶

E.2.1 The Q-Set Approach¶

In scalar Q-learning, we maintain $Q(s,a) \in \mathbb{R}$. In Pareto Q-learning, we maintain a set of non-dominated Q-vectors for each state-action pair.

Definition E.2.1 (Pareto Q-Set) {#DEF-E.2.1}

For each $(s,a)$, the Pareto Q-set $\mathcal{Q}(s,a) \subseteq \mathbb{R}^m$ contains all non-dominated vectors representing possible returns from taking action $a$ in state $s$ and following some Pareto optimal policy thereafter.

Algorithm E.2.1 (Pareto Q-Learning Update) {#ALG-E.2.1}

Given transition $(s, a, \mathbf{r}, s')$:

Compute candidate vectors: $\mathcal{C} = \{\mathbf{r} + \gamma \mathbf{v} : \mathbf{v} \in \mathcal{Q}(s', a') \text{ for some } a'\}$
Compute Pareto filter: $\mathcal{Q}(s,a) \leftarrow \text{ND}(\mathcal{Q}(s,a) \cup \mathcal{C})$

where $\text{ND}(\cdot)$ returns the non-dominated subset.

Remark E.2.1 (Computational complexity). The Q-sets can grow exponentially in the number of objectives $m$ and the number of states $|\mathcal{S}|$. In the worst case, $|\mathcal{Q}(s,a)|$ can be $O(|\mathcal{S}|^{m-1})$. This makes Pareto Q-learning impractical for large state spaces or many objectives. See [@van_moffaert:morl:2014] for analysis.

E.2.2 Practical Approximations¶

Several approaches mitigate the complexity:

Pruning: Keep only the $k$ most diverse non-dominated points
Clustering: Group similar Q-vectors and keep representatives
Function approximation: Learn a parameterized mapping $Q_\theta(s,a) \to \mathbb{R}^m$ (single vector per $(s,a)$, losing multi-policy coverage)

E.3 Coverage Sets and Pareto Front Approximation¶

E.3.1 Coverage Sets¶

Definition E.3.1 (Coverage Set) {#DEF-E.3.1}

A coverage set (CS) for a vector-reward MDP is a finite set of policies $\Pi_{\text{CS}} = \{\pi_1, \ldots, \pi_k\}$ such that for any user preference vector $\mathbf{w} \in \mathbb{R}^m_+$, there exists $\pi_i \in \Pi_{\text{CS}}$ that is optimal or near-optimal under the scalarization $\mathbf{w}^\top \mathbf{G}^\pi$.

Coverage sets are the MORL analog of covering all user preferences with a finite policy library.

E.3.2 Supported vs. Unsupported Pareto Points¶

A crucial distinction for understanding the relationship between CMDP/$\varepsilon$-constraint methods and true MORL:

Definition E.3.2 (Supported Pareto Point) {#DEF-E.3.2}

A Pareto optimal return $\mathbf{v}^* \in \mathcal{P}$ is supported if there exists a weight vector $\mathbf{w} \in \mathbb{R}^m_{++}$ (all strictly positive) such that:

$$ \mathbf{v}^* \in \arg\max_{\mathbf{v} \in \mathcal{P}} \mathbf{w}^\top \mathbf{v} \tag{E.4} $$

Definition E.3.3 (Unsupported Pareto Point) {#DEF-E.3.3}

A Pareto optimal return $\mathbf{v}^* \in \mathcal{P}$ is unsupported if no such weight vector exists. Geometrically, unsupported points lie in "concave" regions of the Pareto front.

Remark E.3.1 (Scalarization limitation). Linear scalarization (weighted sums) can only find supported Pareto points. The $\varepsilon$-constraint method (Chapter 14) can find both supported and unsupported points, which is why it is preferred for tracing the full Pareto front. See [@miettinen:multi_objective_optimization:1999, Chapter 4] for the geometric intuition.

Example E.3.1 (Unsupported point in ranking). Consider a trade-off between GMV and fairness where: - Policy A: (GMV=100, fairness=0.5) - Policy B: (GMV=90, fairness=0.9) - Policy C: (GMV=80, fairness=0.95)

If the Pareto front is concave between A and B, then B might be unsupported: no linear combination $w_1 \cdot \text{GMV} + w_2 \cdot \text{fairness}$ yields B as optimal. The $\varepsilon$-constraint method finds B by setting $\text{fairness} \geq 0.9$ and maximizing GMV.

E.3.3 Empirical Pareto Front Evaluation¶

Definition E.3.4 (Hypervolume Indicator) {#DEF-E.3.4}

Given a reference point $\mathbf{r}_{\text{ref}} \in \mathbb{R}^m$ (typically a lower bound on all objectives), the hypervolume indicator of a Pareto front approximation $\mathcal{P}'$ is the $m$-dimensional volume dominated by $\mathcal{P}'$ and bounded by $\mathbf{r}_{\text{ref}}$:

$$ \text{HV}(\mathcal{P}', \mathbf{r}{\text{ref}}) = \text{Vol}\left(\bigcup]\right) \tag{E.5} $$} \in \mathcal{P}'} [\mathbf{r}_{\text{ref}}, \mathbf{v

Hypervolume is a standard metric for comparing Pareto front approximations ([@vamplew:morl:2011]).

E.4 Multi-Policy vs. Single-Policy MORL¶

E.4.1 Taxonomy¶

MORL algorithms fall into two categories:

Approach	Goal	Output	Example
Multi-policy	Approximate entire Pareto front	Set of policies $\Pi_{\text{CS}}$	Pareto Q-learning
Single-policy	Find one policy given preferences	Single policy $\pi^*$	Scalarized RL, CMDP

E.4.2 When to Use Each¶

Multi-policy MORL is appropriate when: 1. User preferences are unknown at training time 2. Different users have different (and potentially incomparable) preferences 3. The system must serve diverse user populations with a policy library

Single-policy MORL (including CMDP) is appropriate when: 1. Business constraints are fixed by domain experts 2. One policy serves all users 3. Objectives can be scalarized or constrained with known thresholds

E.5 Why CMDP Suffices for Production Ranking¶

Chapter 14 uses CMDP with primal-dual optimization rather than true vector-reward MORL. Here's why this is the right choice for e-commerce search ranking:

E.5.1 Objectives Are Comparable¶

In our setting: - GMV and CM2 are both measured in currency (dollars, euros) - Stability ($\Delta$-rank) has a clear business threshold (e.g., $\leq 5$ position changes) - Fairness gaps have regulatory or policy-defined bands

These objectives have natural units and business-defined acceptable ranges. There is no "incomparability" requiring user-specific preference elicitation.

E.5.2 Thresholds Set by Business, Not Per-Query¶

A key observation: constraint thresholds $\tau_{\text{cm2}}$, $\tau_{\text{stability}}$, $\tau_{\text{fairness}}$ are set by business stakeholders, not by individual users at query time. This means:

One policy per threshold setting is sufficient
The $\varepsilon$-constraint sweep (varying thresholds, training one policy each) traces the Pareto front adequately
No need for a coverage set or multi-policy library

E.5.3 Computational Tractability¶

Approach	Complexity	Practicality
Pareto Q-learning	$O(	\mathcal{S}
CMDP + primal-dual	Same as single-objective RL	Production-ready

CMDP adds only $m$ dual variables (Lagrange multipliers) to a standard policy gradient algorithm. The overhead is negligible.

E.5.4 Lagrange Multipliers Encode Trade-offs¶

The optimal multipliers $\lambda^*$ from CMDP have an economic interpretation: $\lambda_i^*$ is the marginal rate of substitution between the reward and constraint $i$. This provides the same trade-off information as the Pareto front slope, but in a form directly usable for business decisions.

E.6 When You Actually Need True MORL¶

Despite the above, there are scenarios where CMDP is insufficient:

Incomparable objectives: When objectives have no natural common scale (e.g., "user satisfaction" vs. "environmental impact") and no stakeholder can set thresholds.
User-specific preferences at query time: If different users want different GMV-fairness trade-offs and these preferences are revealed at serving time, a coverage set is needed.
Fairness as Pareto constraint: Some fairness definitions require that no policy should dominate another with respect to group-specific outcomes. This is a Pareto-based fairness criterion that cannot be expressed as a scalar constraint.
Exploratory analysis: When the Pareto front shape is unknown and business stakeholders want to see all achievable trade-offs before setting thresholds.

For these cases, the references in $\S$E.7 provide starting points.

Summary¶

This appendix established the relationship between CMDP (Chapter 14) and true vector-reward MORL:

Aspect	CMDP + $\varepsilon$-constraint	Vector-reward MORL
Reward	Scalar + constraints	Vector-valued
Policy comparison	Feasibility + scalar optimality	Pareto dominance
Output	One policy per threshold	Coverage set
Pareto coverage	Supported + unsupported	Full front
Complexity	Standard RL + $m$ multipliers	Exponential in $m$
Use case	Business-defined thresholds	User-specific preferences

Key takeaway: For production search ranking with business-defined constraints, CMDP is the practical choice. It traces the Pareto frontier via threshold sweeps without the computational burden of maintaining Q-sets. True vector-reward MORL is reserved for scenarios with incomparable objectives or user-specific preferences revealed at serving time.

E.7 References and Further Reading¶

Foundational surveys: - [@roijers:survey_morl:2013] --- Comprehensive survey on multi-objective sequential decision-making; defines taxonomy and coverage sets - [@hayes:morl_survey:2022] --- Modern practical guide with algorithm implementations

Pareto Q-learning: - [@van_moffaert:morl:2014] --- Pareto Q-learning with non-dominated policy sets; complexity analysis

Multi-objective optimization (classical): - [@miettinen:multi_objective_optimization:1999] --- Foundational text on $\varepsilon$-constraint and scalarization methods - [@ehrgott:multicriteria_optimization:2005] --- Comprehensive treatment of Pareto optimality; supported vs. unsupported points

Empirical evaluation: - [@vamplew:morl:2011] --- Coverage sets, hypervolume indicator, and empirical Pareto front evaluation methods

Constrained MDPs: - [@altman:constrained_mdps:1999] --- The foundational monograph on constrained Markov decision processes

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

Approach	Goal	Output	Example
Multi-policy	Approximate entire Pareto front	Set of policies \(\Pi_{\text{CS}}\)	Pareto Q-learning
Single-policy	Find one policy given preferences	Single policy \(\pi^*\)	Scalarized RL, CMDP