Policy-induced Markov Chain

Jongeun Choi
School of Mechanical Engineering, Yonsei University
2025, March

Closed-Loop MDP Becomes a Markov Chain

Claim: Under a fixed policy $\pi_\theta$, an MDP becomes a Markov chain over $(s_t, a_t)$, with transition:
$$ p((s_{t+1}, a_{t+1}) \mid (s_t, a_t)) = p(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta(a_{t+1} \mid s_{t+1}) $$
Proof Sketch:

Trajectory Factorization Diagram

Diagram omitted; see LaTeX for graphical details.

Interpretation: Each step factors as $\pi_\theta(a_t | s_t)$ and $p(s_{t+1} | s_t, a_t)$.

$$ p(s_{t+1}, a_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta(a_{t+1} \mid s_{t+1}) $$

Trajectory Probability Derivation

1. Initial State Sampling
$$ p(s_1) \quad \text{(Sampled from initial state distribution)} $$
2. From Time Step 1 to T: Markov Transitions with Policy 3. By the Chain Rule of Probability:
$$ p_\theta(\tau) = p(s_1) \cdot p(s_2, a_2|s_1, a_1) \cdot p(s_3, a_3|s_2, a_2) \cdots p(s_{T+1}, a_{T+1} \mid s_T, a_T) $$
Unfolding this using the conditional factorization:
$$ p_\theta(\tau) = p(s_1) \prod_{t=1}^{T} \pi_\theta(a_t \mid s_t) \cdot p(s_{t+1} \mid s_t, a_t) $$

Trajectory Probability Derivation (cont.)

$$ p_\theta(\tau) = p(s_1, a_1, \dots, s_T, a_T) = p(s_1) \prod_{t=1}^T \pi_\theta(a_t \mid s_t) \cdot p(s_{t+1} \mid s_t, a_t) $$
\begin{align*} p_\theta(\tau) &= p(s_1) \cdot \pi_\theta(a_1 \mid s_1) \cdot p(s_2 \mid s_1, a_1) \cdot \pi_\theta(a_2 \mid s_2) \cdot p(s_3 \mid s_2, a_2) \cdots \\ &= p(s_1)\cdot \pi_\theta(a_1 \mid s_1)\underbrace{ p(s_2 \mid s_1, a_1) \cdot \pi_\theta(a_2 \mid s_2) }_{= p(s_2, a_2 \mid s_1, a_1)} \underbrace{ p(s_3 \mid s_2, a_2) \cdot \pi_\theta (a_3|s_3)}_{= p(s_3, a_3 \mid s_2, a_2)} \cdots \end{align*}

Since we have: $$ p(s_{t+1}, a_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta(a_{t+1} \mid s_{t+1}) $$

For example, if $p(x_{t+1} |x_t)$ is Markovian we have

\begin{align*} p(x_1, x_2, x_3)&= p(x_1)p( x_2, x_3| x_1)\\ &=p(x_1)p(x_2|x_1) p(x_3|x_1, x_2)\\ &=p(x_1)p(x_2|x_1)p(x_3|x_2) \end{align*}