Convergent Properties of Planning and RL Algorithms

Jongeun Choi
School of Mechanical Engineering, Yonsei University
2025, March

Planning Algorithms: Value Iteration (VI) and Policy Iteration (PI)

Assumptions: Convergence:
Key Operator Property:
$$ T^* V(s) = \max_a \left[ r(s,a) + \gamma \sum_{s'} p(s' \mid s,a) V(s') \right] $$ Both algorithms produce the optimal policy and value for a given MDP.

Reinforcement Learning Algorithms: SARSA vs Q-learning

Assumptions: Update Rules: Exploration:

Convergence Conditions Summary Table

AlgorithmNeeds ModelExplorationConverges ToGuarantee
Value IterationYesNo$V^*$, $\pi^*$Yes (via contraction)
Policy IterationYesNo$V^*$, $\pi^*$Yes (finite steps)
Q-learningNoYes$Q^*$Yes (off-policy)
SARSANoYes$Q^{\pi}$Yes (on-policy)
Regularity conditions: For RL, convergence assumes infinite visits, bounded rewards, proper learning rates, and decaying $\varepsilon$.

Target vs Behavior Policy

Behavior Policy $\mu$ Target Policy $\pi$
Comparison Table:
PolicyRole
Behavior $\mu$Generates actions for interaction
Target $\pi$Evaluated or optimized policy

Target vs Behavior Policy: Diagram

Behavior policy $\mu$ generates experience; target policy $\pi$ is improved using that experience.

On-policy vs Off-policy Learning

On-policy: Off-policy: Key Difference:

Advantages of Off-policy Learning

This makes off-policy methods like Q-learning highly flexible and robust in terms of exploration and data reuse.

SARSA: Optimal Policy Learning (Generalized policy iteration)

Initialization: Goal: Learn an optimal policy that leads to the target state from initial state $s_0$

For each episode, do:

Why Does $|\mathcal{A}| - 1$ Appear in $\varepsilon$-Greedy?

$$ \pi(a \mid s) = \begin{cases} 1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|} & \text{if } a = a^* \\ \dfrac{\varepsilon}{|\mathcal{A}|} & \text{otherwise} \end{cases} $$

Let $a^*$ be the greedy action and $|\mathcal{A}|$ be the total number of actions.

$$ \sum_{a \in \mathcal{A}} \pi(a \mid s) = \pi(a^* \mid s) + \sum_{a \ne a^*} \pi(a \mid s) = \left(1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|}\right) + (|\mathcal{A}| - 1) \cdot \dfrac{\varepsilon}{|\mathcal{A}|} = 1 $$
Conclusion: $|\mathcal{A}| - 1$ accounts for the non-greedy actions, ensuring that the total probability sums to 1.

Q-learning: Off-policy Implementation

In the off-policy version of Q-learning, the behavior policy $\pi_b$ generates episodes, while a separate target policy $\pi_T$ is learned.

Pseudocode: Q-learning (off-policy version)

Q-learning: On-policy Implementation

Although Q-learning is off-policy by nature, it can be implemented using an on-policy framework.

Pseudocode: Q-learning (on-policy version)

Robbins-Monro Algorithm[1]: Root Finding with Noisy Observations

The RM algorithm aims to solve a root-finding problem using noisy samples:

$$ g(w) = 0 $$

This can be interpreted as solving a fixed-point problem: $g(w)=w-T(w)=0$

$$ w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k), \quad \text{for } k = 1, 2, 3, \ldots $$

This method uses only input-output pairs — no full knowledge of $g(\cdot)$ is needed.

[1] Kushner, H.J. and Clark, D.S., 2012. Stochastic approximation methods for constrained and unconstrained systems (Vol. 26). Springer Science & Business Media.

Theorem: Robbins-Monro Convergence Conditions

Let $w_k$ be updated by the Robbins-Monro process:

$$ w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k) $$
Assume:

where $\mathcal{H}_k = \{ w_k, w_{k-1}, \ldots \}$.

$$ \frac{d w(\tau)}{d \tau }= -g(w(\tau)) $$

The ODE has an equilibrium point at $\dot w(\tau)=0 \Rightarrow g(w(\tau))=0$

$$ w_k \xrightarrow[]{a.s.} w^* \text{ such that } g(w^*) = 0 \text{ almost surely, i.e., with probability 1} $$ $$ \mathbb{P} \left( \lim_{t \to \infty} w_k = w^* \right) = 1 $$

Applying RM to TD Learning

Bellman (expectation) equation:

$$ v_\pi(s) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s \right] $$

We define:

$$ g(v(s_t)) := v(s_t) - \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) \mid S_t = s_t] $$

Our goal is to solve Bellman equation, i.e., $g(v_\pi(s_t)) = 0$ using Robbins-Monro.

The noisy observation becomes:

$$ \begin{align*} \tilde{g}(v_\pi(s_t)) &= v_\pi(s_t) - [r_{t+1} + \gamma v_\pi(s_{t+1})] \\ &= \underbrace{(v_\pi(s_t) - \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s_t])}_{g(v_\pi(s_t))} \\ &\quad + \underbrace{(\mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s_t] - [r_{t+1} + \gamma v_\pi(s_{t+1})])}_{\eta} \end{align*} $$

Then, RM update is:

$$ v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \tilde{g}(v_t(s_t)) $$

Similar to the RM algo's convergence properties, under regularity conditions, we have

$$ v_t \xrightarrow[]{a.s.} v_\pi \text{ such that } g(v_\pi) = 0\quad \text{(almost surely, i.e., with probability 1)} $$

TD(0): Error Decomposition and Interpretation

We can express the TD(0) update as:

$$ \begin{align*} &\underbrace{v_{t+1}(s_t)}_{\text{new estimate}} = \underbrace{v_t(s_t)}_{\text{current estimate}} - \alpha_t(s_t) \left[ \underbrace{v_t(s_t) - \underbrace{(r_{t+1} + \gamma v_t(s_{t+1}))}_{\text{TD target } \bar{v}_t}}_{\text{TD error } \delta_t} \right] \end{align*} $$

where:

$$ \bar{v}_t \triangleq r_{t+1} + \gamma v_t(s_{t+1}) \quad \text{(TD target)} $$ $$ \delta_t \triangleq v_t(s_t) - \bar{v}_t = v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1})) \quad \text{(TD error)} $$

The update rule blends the current estimate and the TD error.

SARSA: Robbins-Monro Formulation

Goal: Solve the Bellman equation for a fixed policy $\pi$
$$ g_{\text{SARSA}}(Q^\pi)(s,a) = Q^\pi(s,a) - \mathbb{E}_\pi\left[ R_{t+1} + \gamma Q^\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a \right] = 0 $$
SARSA update:
$$ Q_{t+1}(s_t, a_t) = Q_t(s_t, a_t) - \alpha_t \tilde{g}_{\text{SARSA}}(Q_t, \eta_t) $$
Noisy RM update term:
$$ \tilde{g}_{\text{SARSA}}(Q_t, \eta_t) = Q_t(s_t, a_t) - ( r_{t+1} + \gamma Q_t(s_{t+1}, a_{t+1}) ) $$

Under some regularity conditions, we have

$$ Q_t \xrightarrow[]{a.s.} Q^* \text{ such that } g_{\text{SARSA}}(Q^*) = 0\quad \text{(almost surely, i.e., with probability 1)} $$

Theorem 1 (General Convergence Theorem)[2]

Let $\Delta_{n+1}(x) = (1 - \alpha_n(x)) \Delta_n(x) + \beta_n(x) F_n(x)$.

Then $\Delta_n(x) \to 0$ w.p.1 under the assumptions:

  1. The state space is finite.
  2. $\sum_n \alpha_n(x) = \infty$, $\sum_n \alpha_n^2(x) < \infty$, $\sum_n \beta_n(x) = \infty$, $\sum_n \beta_n^2(x) < \infty$.
  3. $\mathbb{E}[\beta_n(x)|P_n] \le \mathbb{E}[\alpha_n(x)|P_n]$ uniformly w.p.1.
  4. $\| \mathbb{E}[F_n(x)|P_n] \|_w \le \gamma \| \Delta_n \|_w$ with $\gamma \in (0, 1)$.
  5. $\mathrm{Var}[F_n(x)|P_n] \le C(1 + \| \Delta_n \|_w^2)$ for some $C$.

Here $P_n = \{\Delta_n, \Delta_{n-1}, \ldots, F_{n-1}, \ldots, \alpha_{n-1}, \ldots, \beta_{n-1}, \ldots\}$ stands for the past at step $n$. $F_n(x)$, $\alpha_n(x)$ and $\beta_n(x)$ are allowed to depend on the past insofar as the above conditions remain valid. The notation $\| \cdot \|_W$ refers to some weighted maximum norm.

[2] T. Jaakkola, M. I. Jordan and S. P. Singh, "On the Convergence of Stochastic Iterative Dynamic Programming Algorithms," in Neural Computation, vol. 6, no. 6, pp. 1185-1201, Nov. 1994, doi: 10.1162/neco.1994.6.6.1185

Interpreting Theorem 1: Role of $F_n$ and $\Delta_n$ (Condition 4)

We consider the iterative process:

$$ \Delta_{n+1}(x) = (1 - \alpha_n(x)) \Delta_n(x) + \beta_n(x) F_n(x) $$
Interpretation: Condition 4: Contraction of Expected Drift
$$ \| \mathbb{E}[F_n(x) | P_n] \|_W \leq \gamma \| \Delta_n \|_W, \quad \text{where } \gamma \in (0,1) $$
In Q-learning:

Theorem 2 (Q-Learning Convergence)

$$ Q_{t+1}(s_t, a_t) = (1 - \alpha_t(s_t, a_t)) Q_t(s_t, a_t) + \alpha_t(s_t, a_t) [r_s(a_t) + \gamma V_t(s_{t+1})] $$

converges to the optimal $Q^*(s,u)$ almost surely if:

  1. The state and action spaces are finite.
  2. $\sum_t \alpha_t(s, a) = \infty$ and $\sum_t \alpha_t^2(s, a) < \infty$ uniformly w.p.1.
  3. $\mathrm{Var}[r_s(a)]$ is bounded.

Sketch of Proof: Q-learning Matches Theorem 1

Conclusion: The stochastic approximation process in Q-learning satisfies the structure required for convergence under Theorem 1.

Sketch of Proof

where $V^\Delta(j) = \max_a |Q_t(j,v) - Q^*(j,v)|$ and $T$ is the DP value iteration operator for the case where the costs associated with each state are zero. If $\gamma < 1$ the contraction property of $\mathbb{E}[F_t(i, u)]$ can be obtained by bounding $\sum_j p_{ij}(u)V^\Delta(j)$ by $\max_j V^\Delta(j)$ and then including the $\gamma$ factor.