Convergent Properties of Planning and RL Algorithms
Jongeun Choi
School of Mechanical Engineering, Yonsei University
2025, March
Planning Algorithms: Value Iteration (VI) and Policy Iteration (PI)
Assumptions:
- Finite MDP with known transition dynamics and rewards
- Discount factor $\gamma \in [0, 1)$
- No need for exploration (model is fully known)
Convergence:
- Value Iteration: converges to $V^*$ using Bellman optimality operator (contraction mapping)
- Policy Iteration: converges to $\pi^*$ in finite steps (strict improvement per iteration)
Key Operator Property:
$$
T^* V(s) = \max_a \left[ r(s,a) + \gamma \sum_{s'} p(s' \mid s,a) V(s') \right]
$$
Both algorithms produce the optimal policy and value for a given MDP.
Reinforcement Learning Algorithms: SARSA vs Q-learning
Assumptions:
- Environment is unknown (model-free)
- Agent learns via sampling transitions
- Requires exploration to ensure convergence
Update Rules:
- SARSA (on-policy): solves Bellman expectation equation
$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha_t [r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]
$$
- Q-learning (off-policy): solves Bellman optimality equation
$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha_t [r_t + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t)]
$$
Exploration:
- Required: typically with $\varepsilon$-greedy policies
- Decaying $\varepsilon \to 0$ ensures convergence to $Q^*$
Convergence Conditions Summary Table
Algorithm | Needs Model | Exploration | Converges To | Guarantee |
Value Iteration | Yes | No | $V^*$, $\pi^*$ | Yes (via contraction) |
Policy Iteration | Yes | No | $V^*$, $\pi^*$ | Yes (finite steps) |
Q-learning | No | Yes | $Q^*$ | Yes (off-policy) |
SARSA | No | Yes | $Q^{\pi}$ | Yes (on-policy) |
Regularity conditions: For RL, convergence assumes infinite visits, bounded rewards, proper learning rates, and decaying $\varepsilon$.
Target vs Behavior Policy
Behavior Policy $\mu$
- The policy used to interact with the environment
- Generates experience (actions taken)
- Often includes exploration (e.g., $\varepsilon$-greedy)
Target Policy $\pi$
- The policy being learned or improved
- Represents desired behavior (e.g., greedy w.r.t. $Q$)
- In off-policy RL: $\pi \ne \mu$; in on-policy RL: $\pi = \mu$
Comparison Table:
Policy | Role |
Behavior $\mu$ | Generates actions for interaction |
Target $\pi$ | Evaluated or optimized policy |
Target vs Behavior Policy: Diagram
Behavior policy $\mu$ generates experience; target policy $\pi$ is improved using that experience.
On-policy vs Off-policy Learning
On-policy:
- When the behavior policy is the same as the target policy, i.e., it learns value of the policy being executed
- Examples: SARSA, Actor-Critic
- Sensitive to behavior policy (e.g., $\varepsilon$-greedy must decay)
Off-policy:
- Learns value of a different (target) policy than behavior
- Examples: Q-learning, DQN
- Can learn optimal policy while behaving randomly
Key Difference:
- On-policy follows its current policy while learning
- Off-policy separates exploration from exploitation
Advantages of Off-policy Learning
- It can search for optimal policies based on the experience samples generated by any other policies.
- The behavior policy can be selected to be exploratory. This is a key advantage in practice.
- For example, to estimate the action-values of all state-action pairs, we can use an exploratory policy to ensure that all pairs are visited sufficiently often.
This makes off-policy methods like Q-learning highly flexible and robust in terms of exploration and data reuse.
SARSA: Optimal Policy Learning (Generalized policy iteration)
Initialization:
- $\alpha_t(s,a) = \alpha > 0$ for all $(s,a)$ and all $t$. $\epsilon \in (0,1)$. Initial $q_0(s,a)$ for all $(s,a)$. Initial $\epsilon$-greedy policy $\pi_0$ derived from $q_0$
Goal: Learn an optimal policy that leads to the target state from initial state $s_0$
For each episode, do:
- Generate $a_0$ at $s_0$ following $\pi_0(s_0)$
- While $s_t$ is not terminal:
- Collect $(r_{t+1}, s_{t+1}, a_{t+1})$ from interaction (i.e., environment + $\pi_t$)
- Update Q-value:
$$
q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - (r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}))]
$$
- Update policy:
$$
\pi_{t+1}(a \mid s_t) = \begin{cases}
1 - \frac{\epsilon}{|\mathcal{A}(s_t)|} (|\mathcal{A}(s_t)| - 1) & \text{if } a = \arg\max_a q_{t+1}(s_t, a) \\
\frac{\epsilon}{|\mathcal{A}(s_t)|} & \text{otherwise}
\end{cases}
$$
- Update $s_t \leftarrow s_{t+1}$, $a_t \leftarrow a_{t+1}$
Why Does $|\mathcal{A}| - 1$ Appear in $\varepsilon$-Greedy?
$$
\pi(a \mid s) =
\begin{cases}
1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|} & \text{if } a = a^* \\
\dfrac{\varepsilon}{|\mathcal{A}|} & \text{otherwise}
\end{cases}
$$
Let $a^*$ be the greedy action and $|\mathcal{A}|$ be the total number of actions.
$$
\sum_{a \in \mathcal{A}} \pi(a \mid s)
= \pi(a^* \mid s) + \sum_{a \ne a^*} \pi(a \mid s)
= \left(1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|}\right) + (|\mathcal{A}| - 1) \cdot \dfrac{\varepsilon}{|\mathcal{A}|}
= 1
$$
Conclusion: $|\mathcal{A}| - 1$ accounts for the non-greedy actions, ensuring that the total probability sums to 1.
Q-learning: Off-policy Implementation
In the off-policy version of Q-learning, the behavior policy $\pi_b$ generates episodes, while a separate target policy $\pi_T$ is learned.
Pseudocode: Q-learning (off-policy version)
- For each episode $\{s_0, a_0, r_1, s_1, a_1, r_2, \ldots\}$ generated by $\pi_b$:
- For each step $t = 0, 1, 2, \ldots$ do:
- Update Q-value:
$$
q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - ( r_{t+1} + \gamma \max_a q_t(s_{t+1}, a) )]
$$
- Update target policy:
$$
\pi_{T,t+1}(a \mid s_t) = \begin{cases}
1 & \text{if } a = \arg\max_a q_{t+1}(s_t, a) \\
0 & \text{otherwise}
\end{cases}
$$
Q-learning: On-policy Implementation
Although Q-learning is off-policy by nature, it can be implemented using an on-policy framework.
Pseudocode: Q-learning (on-policy version)
- For each episode, do:
- If current state $s_t$ is not terminal:
- Collect transition $(s_t, a_t, r_{t+1}, s_{t+1})$ by following current policy $\pi_t(s_t)$
- Update Q-value:
$$
q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) [q_t(s_t, a_t) - ( r_{t+1} + \gamma \max_a q_t(s_{t+1}, a) )]
$$
- Update Policy:
$$
\pi_{t+1}(a \mid s_t) = \begin{cases}
1 - \frac{\varepsilon}{|\mathcal{A}| - 1} & \text{if } a = \arg\max_a q_{t+1}(s_t, a) \\
\frac{\varepsilon}{|\mathcal{A}|} & \text{otherwise}
\end{cases}
$$
Robbins-Monro Algorithm[1]: Root Finding with Noisy Observations
The RM algorithm aims to solve a root-finding problem using noisy samples:
$$
g(w) = 0
$$
This can be interpreted as solving a fixed-point problem: $g(w)=w-T(w)=0$
$$
w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k), \quad \text{for } k = 1, 2, 3, \ldots
$$
- $w_k$: current estimate of the root
- $\tilde{g}(w_k, \eta_k)$: noisy estimate of $g(w_k)$
- $\alpha_k > 0$: learning rate (step size)
This method uses only input-output pairs — no full knowledge of $g(\cdot)$ is needed.
[1] Kushner, H.J. and Clark, D.S., 2012. Stochastic approximation methods for constrained and unconstrained systems (Vol. 26). Springer Science & Business Media.
Theorem: Robbins-Monro Convergence Conditions
Let $w_k$ be updated by the Robbins-Monro process:
$$
w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k)
$$
Assume:
- $(a)$ $0 < c_1 \leq \nabla_w g(w) \leq c_2$ for all $w$ (positive definite, strong monotonicity)
- $(b)$ $\alpha_k \in [0, 1]$, $\sum_{k=1}^\infty \alpha_k = \infty$ and $\sum_{k=1}^\infty \alpha_k^2 < \infty$
- $(c)$ $\mathbb{E}[\eta_k \mid \mathcal{H}_k] = 0$ and $\mathbb{E}[\eta_k^2 \mid \mathcal{H}_k] < \infty$ (zero-mean bounded noise)
where $\mathcal{H}_k = \{ w_k, w_{k-1}, \ldots \}$.
$$
\frac{d w(\tau)}{d \tau }= -g(w(\tau))
$$
The ODE has an equilibrium point at $\dot w(\tau)=0 \Rightarrow g(w(\tau))=0$
$$
w_k \xrightarrow[]{a.s.} w^* \text{ such that } g(w^*) = 0 \text{ almost surely, i.e., with probability 1}
$$
$$
\mathbb{P} \left( \lim_{t \to \infty} w_k = w^* \right) = 1
$$
Applying RM to TD Learning
Bellman (expectation) equation:
$$
v_\pi(s) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s \right]
$$
We define:
$$
g(v(s_t)) := v(s_t) - \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) \mid S_t = s_t]
$$
Our goal is to solve Bellman equation, i.e., $g(v_\pi(s_t)) = 0$ using Robbins-Monro.
The noisy observation becomes:
$$
\begin{align*}
\tilde{g}(v_\pi(s_t)) &= v_\pi(s_t) - [r_{t+1} + \gamma v_\pi(s_{t+1})] \\
&= \underbrace{(v_\pi(s_t) - \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s_t])}_{g(v_\pi(s_t))} \\
&\quad + \underbrace{(\mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s_t] - [r_{t+1} + \gamma v_\pi(s_{t+1})])}_{\eta}
\end{align*}
$$
Then, RM update is:
$$
v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \tilde{g}(v_t(s_t))
$$
Similar to the RM algo's convergence properties, under regularity conditions, we have
$$
v_t \xrightarrow[]{a.s.} v_\pi \text{ such that } g(v_\pi) = 0\quad \text{(almost surely, i.e., with probability 1)}
$$
TD(0): Error Decomposition and Interpretation
We can express the TD(0) update as:
$$
\begin{align*}
&\underbrace{v_{t+1}(s_t)}_{\text{new estimate}} = \underbrace{v_t(s_t)}_{\text{current estimate}}
- \alpha_t(s_t) \left[ \underbrace{v_t(s_t) - \underbrace{(r_{t+1} + \gamma v_t(s_{t+1}))}_{\text{TD target } \bar{v}_t}}_{\text{TD error } \delta_t} \right]
\end{align*}
$$
where:
$$
\bar{v}_t \triangleq r_{t+1} + \gamma v_t(s_{t+1}) \quad \text{(TD target)}
$$
$$
\delta_t \triangleq v_t(s_t) - \bar{v}_t = v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1})) \quad \text{(TD error)}
$$
The update rule blends the current estimate and the TD error.
SARSA: Robbins-Monro Formulation
Goal: Solve the Bellman equation for a fixed policy $\pi$
$$
g_{\text{SARSA}}(Q^\pi)(s,a) = Q^\pi(s,a) - \mathbb{E}_\pi\left[ R_{t+1} + \gamma Q^\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a \right] = 0
$$
SARSA update:
$$
Q_{t+1}(s_t, a_t) = Q_t(s_t, a_t) - \alpha_t \tilde{g}_{\text{SARSA}}(Q_t, \eta_t)
$$
Noisy RM update term:
$$
\tilde{g}_{\text{SARSA}}(Q_t, \eta_t) = Q_t(s_t, a_t) - ( r_{t+1} + \gamma Q_t(s_{t+1}, a_{t+1}) )
$$
- $g_{\text{SARSA}}(Q)$ is the Bellman operator error (with $Q^\pi$, $g_{\text{SARSA}}(Q^\pi)=0$)
- $\tilde g_{\text{SARSA}}(Q)$ is the sampled version of $g_{\text{SARSA}}(Q)$
- Noise comes from transition $(s_{t+1}, a_{t+1})$ and reward sample $r_{t+1}$
Under some regularity conditions, we have
$$
Q_t \xrightarrow[]{a.s.} Q^* \text{ such that } g_{\text{SARSA}}(Q^*) = 0\quad \text{(almost surely, i.e., with probability 1)}
$$
Theorem 1 (General Convergence Theorem)[2]
Let $\Delta_{n+1}(x) = (1 - \alpha_n(x)) \Delta_n(x) + \beta_n(x) F_n(x)$.
Then $\Delta_n(x) \to 0$ w.p.1 under the assumptions:
- The state space is finite.
- $\sum_n \alpha_n(x) = \infty$, $\sum_n \alpha_n^2(x) < \infty$, $\sum_n \beta_n(x) = \infty$, $\sum_n \beta_n^2(x) < \infty$.
- $\mathbb{E}[\beta_n(x)|P_n] \le \mathbb{E}[\alpha_n(x)|P_n]$ uniformly w.p.1.
- $\| \mathbb{E}[F_n(x)|P_n] \|_w \le \gamma \| \Delta_n \|_w$ with $\gamma \in (0, 1)$.
- $\mathrm{Var}[F_n(x)|P_n] \le C(1 + \| \Delta_n \|_w^2)$ for some $C$.
Here $P_n = \{\Delta_n, \Delta_{n-1}, \ldots, F_{n-1}, \ldots, \alpha_{n-1}, \ldots, \beta_{n-1}, \ldots\}$ stands for the past at step $n$. $F_n(x)$, $\alpha_n(x)$ and $\beta_n(x)$ are allowed to depend on the past insofar as the above conditions remain valid. The notation $\| \cdot \|_W$ refers to some weighted maximum norm.
[2] T. Jaakkola, M. I. Jordan and S. P. Singh, "On the Convergence of Stochastic Iterative Dynamic Programming Algorithms," in Neural Computation, vol. 6, no. 6, pp. 1185-1201, Nov. 1994, doi: 10.1162/neco.1994.6.6.1185
Interpreting Theorem 1: Role of $F_n$ and $\Delta_n$ (Condition 4)
We consider the iterative process:
$$
\Delta_{n+1}(x) = (1 - \alpha_n(x)) \Delta_n(x) + \beta_n(x) F_n(x)
$$
Interpretation:
- $\Delta_n(x)$ represents the error between the current estimate and the fixed point (e.g., $Q_n - Q^*$).
- The update has two terms:
- $(1 - \alpha_n)\Delta_n$: a contraction term that shrinks the error.
- $\beta_n F_n$: a perturbation term that injects noise or correction.
Condition 4: Contraction of Expected Drift
$$
\| \mathbb{E}[F_n(x) | P_n] \|_W \leq \gamma \| \Delta_n \|_W, \quad \text{where } \gamma \in (0,1)
$$
- Ensures that on average, the direction $F_n(x)$ points toward reducing $\Delta_n(x)$.
- $\gamma < 1$ implies a contraction mapping, essential for stability.
- It restricts $F_n$ to behave like a controlled deviation, not an arbitrary perturbation.
In Q-learning:
- $\Delta_n(s,a) = Q_n(s,a) - Q^*(s,a)$
- $F_n(s,a) = r + \gamma \max_{a'} Q_n(s', a') - Q^*(s,a)$
- Bellman optimality ensures: $\mathbb{E}[F_n | P_n]$ contracts with rate $\gamma$ in $\ell_\infty$
Theorem 2 (Q-Learning Convergence)
$$
Q_{t+1}(s_t, a_t) = (1 - \alpha_t(s_t, a_t)) Q_t(s_t, a_t) + \alpha_t(s_t, a_t) [r_s(a_t) + \gamma V_t(s_{t+1})]
$$
converges to the optimal $Q^*(s,u)$ almost surely if:
- The state and action spaces are finite.
- $\sum_t \alpha_t(s, a) = \infty$ and $\sum_t \alpha_t^2(s, a) < \infty$ uniformly w.p.1.
- $\mathrm{Var}[r_s(a)]$ is bounded.
Sketch of Proof: Q-learning Matches Theorem 1
- Define the error and noise processes:
$$
\Delta_t(s, a) = Q_t(s, a) - Q^*(s, a); \quad
F_t(s, a) = r_s(a) + \gamma V_t(s') - Q^*(s, a)
$$
- Subtracting $Q^*(s,a)$ from both sides from the Q-learning update becomes:
$$
Q_{t+1}(s,a)-Q^* = (1 - \alpha_t) Q_t(s,a)-(1-\alpha_t)Q^* + \alpha_t [r + \gamma V_t(s')]-\alpha_t Q^*
$$
- We then have:
$$
\Delta_{t+1}(s,a) = (1 - \alpha_t) \Delta_t(s,a) + \alpha_t F_t(s,a)
$$
- This is exactly the form of Theorem 1 with:
$$
\beta_t = \alpha_t
$$
Conclusion: The stochastic approximation process in Q-learning satisfies the structure required for convergence under Theorem 1.
Sketch of Proof
- Define $\Delta_t(s,a) = Q_t(s,a) - Q^*(s,a)$
- Define $F_t(s,a) = r_s(a) + \gamma V_t(s') - Q^*(s,a)$
- Then Q-learning update matches Theorem 1 with $\beta_t = \alpha_t$.
- To check the condition for $F_t(s,a)$, we show:
$$
\begin{align*}
&\max_a \left| \mathbb{E}[F_t(i, u)] \right| \\
&= \gamma \max_a \left| \sum_j p_{ij}(u)[V_t(j) - V^*(j)] \right| \\
&\leq \gamma \max_a \sum_j p_{ij}(u) \max_v \left| Q_t(j, v) - Q^*(j, v) \right| \\
&= \gamma \max_a \sum_j p_{ij}(u) V^\Delta(j) \\
&= T(V^\Delta)(i)
\end{align*}
$$
where $V^\Delta(j) = \max_a |Q_t(j,v) - Q^*(j,v)|$ and $T$ is the DP value iteration operator for the case where the costs associated with each state are zero. If $\gamma < 1$ the contraction property of $\mathbb{E}[F_t(i, u)]$ can be obtained by bounding $\sum_j p_{ij}(u)V^\Delta(j)$ by $\max_j V^\Delta(j)$ and then including the $\gamma$ factor.