Policy-induced Markov Chain

Jongeun Choi
School of Mechanical Engineering, Yonsei University

2025, March

Closed-Loop MDP Becomes a Markov Chain

Claim: Under a fixed policy $\pi_\theta$, an MDP becomes a Markov chain over $(s_t, a_t)$, with transition:

$$ p((s_{t+1}, a_{t+1}) \mid (s_t, a_t)) = p(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta(a_{t+1} \mid s_{t+1}) $$

Proof Sketch:

Start with the joint conditional probability:
$$ p(s_{t+1}, a_{t+1} \mid s_t, a_t) $$
Use the chain rule of probability (i.e., conditional probability):
$$ p(s_{t+1}, a_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid s_t, a_t) \cdot p(a_{t+1} \mid s_{t+1}, s_t, a_t) $$
Since $a_{t+1}$ action selection depends only on $s_{t+1}$ under policy $\pi_\theta(a_{t+1}|s_{t+1})$:
$$ p(s_{t+1}, a_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta(a_{t+1} \mid s_{t+1}) $$

Trajectory Factorization Diagram

Diagram omitted; see LaTeX for graphical details.

Interpretation: Each step factors as $\pi_\theta(a_t | s_t)$ and $p(s_{t+1} | s_t, a_t)$.

$$ p(s_{t+1}, a_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta(a_{t+1} \mid s_{t+1}) $$

Trajectory Probability Derivation

1. Initial State Sampling

$$ p(s_1) \quad \text{(Sampled from initial state distribution)} $$

2. From Time Step 1 to T: Markov Transitions with Policy

For each time $t = 1, \ldots, T$, we apply:
- Action selection: $a_t \sim \pi_\theta(a_t \mid s_t)$
- State transition: $s_{t+1} \sim p(s_{t+1} \mid s_t, a_t)$
So the joint distribution for each step:
$$ p(s_{t+1}, a_{t+1} \mid s_t, a_t) = \pi_\theta(a_{t+1} \mid s_{t+1}) \cdot p(s_{t+1} \mid s_t, a_t) $$

3. By the Chain Rule of Probability:

$$ p_\theta(\tau) = p(s_1) \cdot p(s_2, a_2|s_1, a_1) \cdot p(s_3, a_3|s_2, a_2) \cdots p(s_{T+1}, a_{T+1} \mid s_T, a_T) $$

Unfolding this using the conditional factorization:

$$ p_\theta(\tau) = p(s_1) \prod_{t=1}^{T} \pi_\theta(a_t \mid s_t) \cdot p(s_{t+1} \mid s_t, a_t) $$

Trajectory Probability Derivation (cont.)

$$ p_\theta(\tau) = p(s_1, a_1, \dots, s_T, a_T) = p(s_1) \prod_{t=1}^T \pi_\theta(a_t \mid s_t) \cdot p(s_{t+1} \mid s_t, a_t) $$

\begin{align*} p_\theta(\tau) &= p(s_1) \cdot \pi_\theta(a_1 \mid s_1) \cdot p(s_2 \mid s_1, a_1) \cdot \pi_\theta(a_2 \mid s_2) \cdot p(s_3 \mid s_2, a_2) \cdots \\ &= p(s_1)\cdot \pi_\theta(a_1 \mid s_1)\underbrace{ p(s_2 \mid s_1, a_1) \cdot \pi_\theta(a_2 \mid s_2) }_{= p(s_2, a_2 \mid s_1, a_1)} \underbrace{ p(s_3 \mid s_2, a_2) \cdot \pi_\theta (a_3|s_3)}_{= p(s_3, a_3 \mid s_2, a_2)} \cdots \end{align*}

Since we have: $$ p(s_{t+1}, a_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta(a_{t+1} \mid s_{t+1}) $$

For example, if $p(x_{t+1} |x_t)$ is Markovian we have

\begin{align*} p(x_1, x_2, x_3)&= p(x_1)p( x_2, x_3| x_1)\\ &=p(x_1)p(x_2|x_1) p(x_3|x_1, x_2)\\ &=p(x_1)p(x_2|x_1)p(x_3|x_2) \end{align*}