Convergent Properties of Planning and RL Algorithms

Jongeun Choi
School of Mechanical Engineering, Yonsei University

2025, March

Planning Algorithms: Value Iteration (VI) and Policy Iteration (PI)

Assumptions:

Finite MDP with known transition dynamics and rewards
Discount factor $\gamma \in [0, 1)$
No need for exploration (model is fully known)

Convergence:

Value Iteration: converges to $V^*$ using Bellman optimality operator (contraction mapping)
Policy Iteration: converges to $\pi^*$ in finite steps (strict improvement per iteration)

Key Operator Property:
$$ T^* V(s) = \max_a \left[ r(s,a) + \gamma \sum_{s'} p(s' \mid s,a) V(s') \right] $$ Both algorithms produce the optimal policy and value for a given MDP.

Reinforcement Learning Algorithms: SARSA vs Q-learning

Assumptions:

Environment is unknown (model-free)
Agent learns via sampling transitions
Requires exploration to ensure convergence

Update Rules:

SARSA (on-policy): solves Bellman expectation equation
$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha_t [r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)] $$
Q-learning (off-policy): solves Bellman optimality equation
$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha_t [r_t + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t)] $$

Exploration:

Required: typically with $\varepsilon$-greedy policies
Decaying $\varepsilon \to 0$ ensures convergence to $Q^*$

Convergence Conditions Summary Table

Algorithm	Needs Model	Exploration	Converges To	Guarantee
Value Iteration	Yes	No	$V^$, $\pi^$	Yes (via contraction)
Policy Iteration	Yes	No	$V^$, $\pi^$	Yes (finite steps)
Q-learning	No	Yes	$Q^*$	Yes (off-policy)
SARSA	No	Yes	$Q^{\pi}$	Yes (on-policy)

Regularity conditions: For RL, convergence assumes infinite visits, bounded rewards, proper learning rates, and decaying $\varepsilon$.

Target vs Behavior Policy

Behavior Policy $\mu$

The policy used to interact with the environment
Generates experience (actions taken)
Often includes exploration (e.g., $\varepsilon$-greedy)

Target Policy $\pi$

The policy being learned or improved
Represents desired behavior (e.g., greedy w.r.t. $Q$)
In off-policy RL: $\pi \ne \mu$; in on-policy RL: $\pi = \mu$

Comparison Table:

Policy	Role
Behavior $\mu$	Generates actions for interaction
Target $\pi$	Evaluated or optimized policy

Target vs Behavior Policy: Diagram

Behavior policy $\mu$ generates experience; target policy $\pi$ is improved using that experience.

On-policy vs Off-policy Learning

On-policy:

When the behavior policy is the same as the target policy, i.e., it learns value of the policy being executed
Examples: SARSA, Actor-Critic
Sensitive to behavior policy (e.g., $\varepsilon$-greedy must decay)

Off-policy:

Learns value of a different (target) policy than behavior
Examples: Q-learning, DQN
Can learn optimal policy while behaving randomly

Key Difference:

On-policy follows its current policy while learning
Off-policy separates exploration from exploitation

Advantages of Off-policy Learning

It can search for optimal policies based on the experience samples generated by any other policies.
The behavior policy can be selected to be exploratory. This is a key advantage in practice.
For example, to estimate the action-values of all state-action pairs, we can use an exploratory policy to ensure that all pairs are visited sufficiently often.

This makes off-policy methods like Q-learning highly flexible and robust in terms of exploration and data reuse.

SARSA: Optimal Policy Learning (Generalized policy iteration)

Initialization:

$\alpha_t(s,a) = \alpha > 0$ for all $(s,a)$ and all $t$. $\epsilon \in (0,1)$. Initial $q_0(s,a)$ for all $(s,a)$. Initial $\epsilon$-greedy policy $\pi_0$ derived from $q_0$

Goal: Learn an optimal policy that leads to the target state from initial state $s_0$

For each episode, do:

Generate $a_0$ at $s_0$ following $\pi_0(s_0)$
While $s_t$ is not terminal:
- Collect $(r_{t+1}, s_{t+1}, a_{t+1})$ from interaction (i.e., environment + $\pi_t$)
- Update Q-value:
  $$ q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - (r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}))] $$
- Update policy:
  $$ \pi_{t+1}(a \mid s_t) = \begin{cases} 1 - \frac{\epsilon}{|\mathcal{A}(s_t)|} (|\mathcal{A}(s_t)| - 1) & \text{if } a = \arg\max_a q_{t+1}(s_t, a) \\ \frac{\epsilon}{|\mathcal{A}(s_t)|} & \text{otherwise} \end{cases} $$
- Update $s_t \leftarrow s_{t+1}$, $a_t \leftarrow a_{t+1}$

Why Does $|\mathcal{A}| - 1$ Appear in $\varepsilon$-Greedy?

$$ \pi(a \mid s) = \begin{cases} 1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|} & \text{if } a = a^* \\ \dfrac{\varepsilon}{|\mathcal{A}|} & \text{otherwise} \end{cases} $$

Let $a^*$ be the greedy action and $|\mathcal{A}|$ be the total number of actions.

$$ \sum_{a \in \mathcal{A}} \pi(a \mid s) = \pi(a^* \mid s) + \sum_{a \ne a^*} \pi(a \mid s) = \left(1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|}\right) + (|\mathcal{A}| - 1) \cdot \dfrac{\varepsilon}{|\mathcal{A}|} = 1 $$

Conclusion: $|\mathcal{A}| - 1$ accounts for the non-greedy actions, ensuring that the total probability sums to 1.

Q-learning: Off-policy Implementation

In the off-policy version of Q-learning, the behavior policy $\pi_b$ generates episodes, while a separate target policy $\pi_T$ is learned.

Pseudocode: Q-learning (off-policy version)

For each episode $\{s_0, a_0, r_1, s_1, a_1, r_2, \ldots\}$ generated by $\pi_b$:
- For each step $t = 0, 1, 2, \ldots$ do:
  - Update Q-value:
    $$ q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - ( r_{t+1} + \gamma \max_a q_t(s_{t+1}, a) )] $$
  - Update target policy:
    $$ \pi_{T,t+1}(a \mid s_t) = \begin{cases} 1 & \text{if } a = \arg\max_a q_{t+1}(s_t, a) \\ 0 & \text{otherwise} \end{cases} $$

Q-learning: On-policy Implementation

Although Q-learning is off-policy by nature, it can be implemented using an on-policy framework.

Pseudocode: Q-learning (on-policy version)

For each episode, do:
- If current state $s_t$ is not terminal:
  - Collect transition $(s_t, a_t, r_{t+1}, s_{t+1})$ by following current policy $\pi_t(s_t)$
  - Update Q-value:
    $$ q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - ( r_{t+1} + \gamma \max_a q_t(s_{t+1}, a) )] $$
  - Update Policy:
    $$ \pi_{t+1}(a \mid s_t) = \begin{cases} 1 - \frac{\varepsilon}{|\mathcal{A}| - 1} & \text{if } a = \arg\max_a q_{t+1}(s_t, a) \\ \frac{\varepsilon}{|\mathcal{A}|} & \text{otherwise} \end{cases} $$

Robbins-Monro Algorithm^[1]: Root Finding with Noisy Observations

The RM algorithm aims to solve a root-finding problem using noisy samples:

$$ g(w) = 0 $$

This can be interpreted as solving a fixed-point problem: $g(w)=w-T(w)=0$

$$ w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k), \quad \text{for } k = 1, 2, 3, \ldots $$

$w_k$: current estimate of the root
$\tilde{g}(w_k, \eta_k)$: noisy estimate of $g(w_k)$
$\alpha_k > 0$: learning rate (step size)

This method uses only input-output pairs — no full knowledge of $g(\cdot)$ is needed.

[1] Kushner, H.J. and Clark, D.S., 2012. Stochastic approximation methods for constrained and unconstrained systems (Vol. 26). Springer Science & Business Media.

Theorem: Robbins-Monro Convergence Conditions

Let $w_k$ be updated by the Robbins-Monro process:

$$ w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k) $$

Assume:

$(a)$ $0 < c_1 \leq \nabla_w g(w) \leq c_2$ for all $w$ (positive definite, strong monotonicity)
$(b)$ $\alpha_k \in [0, 1]$, $\sum_{k=1}^\infty \alpha_k = \infty$ and $\sum_{k=1}^\infty \alpha_k^2 < \infty$
$(c)$ $\mathbb{E}[\eta_k \mid \mathcal{H}_k] = 0$ and $\mathbb{E}[\eta_k^2 \mid \mathcal{H}_k] < \infty$ (zero-mean bounded noise)

where $\mathcal{H}_k = \{ w_k, w_{k-1}, \ldots \}$.

$$ \frac{d w(\tau)}{d \tau }= -g(w(\tau)) $$

The ODE has an equilibrium point at $\dot w(\tau)=0 \Rightarrow g(w(\tau))=0$

$$ w_k \xrightarrow[]{a.s.} w^* \text{ such that } g(w^*) = 0 \text{ almost surely, i.e., with probability 1} $$ $$ \mathbb{P} \left( \lim_{t \to \infty} w_k = w^* \right) = 1 $$

Applying RM to TD Learning

Bellman (expectation) equation:

$$ v_\pi(s) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s \right] $$

We define:

$$ g(v(s_t)) := v(s_t) - \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) \mid S_t = s_t] $$

Our goal is to solve Bellman equation, i.e., $g(v_\pi(s_t)) = 0$ using Robbins-Monro.

The noisy observation becomes:

$$ \begin{align*} \tilde{g}(v_\pi(s_t)) &= v_\pi(s_t) - [r_{t+1} + \gamma v_\pi(s_{t+1})] \\ &= \underbrace{(v_\pi(s_t) - \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s_t])}_{g(v_\pi(s_t))} \\ &\quad + \underbrace{(\mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s_t] - [r_{t+1} + \gamma v_\pi(s_{t+1})])}_{\eta} \end{align*} $$

Then, RM update is:

$$ v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \tilde{g}(v_t(s_t)) $$

Similar to the RM algo's convergence properties, under regularity conditions, we have

$$ v_t \xrightarrow[]{a.s.} v_\pi \text{ such that } g(v_\pi) = 0\quad \text{(almost surely, i.e., with probability 1)} $$

TD(0): Error Decomposition and Interpretation

We can express the TD(0) update as:

$$ \begin{align*} &\underbrace{v_{t+1}(s_t)}_{\text{new estimate}} = \underbrace{v_t(s_t)}_{\text{current estimate}} - \alpha_t(s_t) \left[ \underbrace{v_t(s_t) - \underbrace{(r_{t+1} + \gamma v_t(s_{t+1}))}_{\text{TD target } \bar{v}_t}}_{\text{TD error } \delta_t} \right] \end{align*} $$

where:

$$ \bar{v}_t \triangleq r_{t+1} + \gamma v_t(s_{t+1}) \quad \text{(TD target)} $$ $$ \delta_t \triangleq v_t(s_t) - \bar{v}_t = v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1})) \quad \text{(TD error)} $$

The update rule blends the current estimate and the TD error.

SARSA: Robbins-Monro Formulation

Goal: Solve the Bellman equation for a fixed policy $\pi$

$$ g_{\text{SARSA}}(Q^\pi)(s,a) = Q^\pi(s,a) - \mathbb{E}_\pi\left[ R_{t+1} + \gamma Q^\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a \right] = 0 $$

SARSA update:

$$ Q_{t+1}(s_t, a_t) = Q_t(s_t, a_t) - \alpha_t \tilde{g}_{\text{SARSA}}(Q_t, \eta_t) $$

Noisy RM update term:

$$ \tilde{g}_{\text{SARSA}}(Q_t, \eta_t) = Q_t(s_t, a_t) - ( r_{t+1} + \gamma Q_t(s_{t+1}, a_{t+1}) ) $$

$g_{\text{SARSA}}(Q)$ is the Bellman operator error (with $Q^\pi$, $g_{\text{SARSA}}(Q^\pi)=0$)
$\tilde g_{\text{SARSA}}(Q)$ is the sampled version of $g_{\text{SARSA}}(Q)$
Noise comes from transition $(s_{t+1}, a_{t+1})$ and reward sample $r_{t+1}$

Under some regularity conditions, we have

$$ Q_t \xrightarrow[]{a.s.} Q^* \text{ such that } g_{\text{SARSA}}(Q^*) = 0\quad \text{(almost surely, i.e., with probability 1)} $$

Theorem 1 (General Convergence Theorem)^[2]

Let $\Delta_{n+1}(x) = (1 - \alpha_n(x)) \Delta_n(x) + \beta_n(x) F_n(x)$.

Then $\Delta_n(x) \to 0$ w.p.1 under the assumptions:

The state space is finite.
$\sum_n \alpha_n(x) = \infty$, $\sum_n \alpha_n^2(x) < \infty$, $\sum_n \beta_n(x) = \infty$, $\sum_n \beta_n^2(x) < \infty$.
$\mathbb{E}[\beta_n(x)|P_n] \le \mathbb{E}[\alpha_n(x)|P_n]$ uniformly w.p.1.
$\| \mathbb{E}[F_n(x)|P_n] \|_w \le \gamma \| \Delta_n \|_w$ with $\gamma \in (0, 1)$.
$\mathrm{Var}[F_n(x)|P_n] \le C(1 + \| \Delta_n \|_w^2)$ for some $C$.

Here $P_n = \{\Delta_n, \Delta_{n-1}, \ldots, F_{n-1}, \ldots, \alpha_{n-1}, \ldots, \beta_{n-1}, \ldots\}$ stands for the past at step $n$. $F_n(x)$, $\alpha_n(x)$ and $\beta_n(x)$ are allowed to depend on the past insofar as the above conditions remain valid. The notation $\| \cdot \|_W$ refers to some weighted maximum norm.

[2] T. Jaakkola, M. I. Jordan and S. P. Singh, "On the Convergence of Stochastic Iterative Dynamic Programming Algorithms," in Neural Computation, vol. 6, no. 6, pp. 1185-1201, Nov. 1994, doi: 10.1162/neco.1994.6.6.1185

Interpreting Theorem 1: Role of $F_n$ and $\Delta_n$ (Condition 4)

We consider the iterative process:

$$ \Delta_{n+1}(x) = (1 - \alpha_n(x)) \Delta_n(x) + \beta_n(x) F_n(x) $$

Interpretation:

$\Delta_n(x)$ represents the error between the current estimate and the fixed point (e.g., $Q_n - Q^*$).
The update has two terms:
- $(1 - \alpha_n)\Delta_n$: a contraction term that shrinks the error.
- $\beta_n F_n$: a perturbation term that injects noise or correction.

Condition 4: Contraction of Expected Drift

$$ \| \mathbb{E}[F_n(x) | P_n] \|_W \leq \gamma \| \Delta_n \|_W, \quad \text{where } \gamma \in (0,1) $$

Ensures that on average, the direction $F_n(x)$ points toward reducing $\Delta_n(x)$.
$\gamma < 1$ implies a contraction mapping, essential for stability.
It restricts $F_n$ to behave like a controlled deviation, not an arbitrary perturbation.

In Q-learning:

$\Delta_n(s,a) = Q_n(s,a) - Q^*(s,a)$
$F_n(s,a) = r + \gamma \max_{a'} Q_n(s', a') - Q^*(s,a)$
Bellman optimality ensures: $\mathbb{E}[F_n | P_n]$ contracts with rate $\gamma$ in $\ell_\infty$

Theorem 2 (Q-Learning Convergence)

$$ Q_{t+1}(s_t, a_t) = (1 - \alpha_t(s_t, a_t)) Q_t(s_t, a_t) + \alpha_t(s_t, a_t) [r_s(a_t) + \gamma V_t(s_{t+1})] $$

converges to the optimal $Q^*(s,u)$ almost surely if:

The state and action spaces are finite.
$\sum_t \alpha_t(s, a) = \infty$ and $\sum_t \alpha_t^2(s, a) < \infty$ uniformly w.p.1.
$\mathrm{Var}[r_s(a)]$ is bounded.

Sketch of Proof: Q-learning Matches Theorem 1

Define the error and noise processes:
$$ \Delta_t(s, a) = Q_t(s, a) - Q^*(s, a); \quad F_t(s, a) = r_s(a) + \gamma V_t(s') - Q^*(s, a) $$
Subtracting $Q^*(s,a)$ from both sides from the Q-learning update becomes:
$$ Q_{t+1}(s,a)-Q^* = (1 - \alpha_t) Q_t(s,a)-(1-\alpha_t)Q^* + \alpha_t [r + \gamma V_t(s')]-\alpha_t Q^* $$
We then have:
$$ \Delta_{t+1}(s,a) = (1 - \alpha_t) \Delta_t(s,a) + \alpha_t F_t(s,a) $$
This is exactly the form of Theorem 1 with:
$$ \beta_t = \alpha_t $$

Conclusion: The stochastic approximation process in Q-learning satisfies the structure required for convergence under Theorem 1.

Sketch of Proof

Define $\Delta_t(s,a) = Q_t(s,a) - Q^*(s,a)$
Define $F_t(s,a) = r_s(a) + \gamma V_t(s') - Q^*(s,a)$
Then Q-learning update matches Theorem 1 with $\beta_t = \alpha_t$.
To check the condition for $F_t(s,a)$, we show:
$$ \begin{align*} &\max_a \left| \mathbb{E}[F_t(i, u)] \right| \\ &= \gamma \max_a \left| \sum_j p_{ij}(u)[V_t(j) - V^*(j)] \right| \\ &\leq \gamma \max_a \sum_j p_{ij}(u) \max_v \left| Q_t(j, v) - Q^*(j, v) \right| \\ &= \gamma \max_a \sum_j p_{ij}(u) V^\Delta(j) \\ &= T(V^\Delta)(i) \end{align*} $$

where $V^\Delta(j) = \max_a |Q_t(j,v) - Q^*(j,v)|$ and $T$ is the DP value iteration operator for the case where the costs associated with each state are zero. If $\gamma < 1$ the contraction property of $\mathbb{E}[F_t(i, u)]$ can be obtained by bounding $\sum_j p_{ij}(u)V^\Delta(j)$ by $\max_j V^\Delta(j)$ and then including the $\gamma$ factor.