Closed-Loop MDP Becomes a Markov Chain
Claim: Under a fixed policy $\pi_\theta$, an MDP becomes a Markov chain over $(s_t, a_t)$, with transition:
$$
p((s_{t+1}, a_{t+1}) \mid (s_t, a_t)) = p(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta(a_{t+1} \mid s_{t+1})
$$
Proof Sketch:
- Start with the joint conditional probability:
$$
p(s_{t+1}, a_{t+1} \mid s_t, a_t)
$$
- Use the chain rule of probability (i.e., conditional probability):
$$
p(s_{t+1}, a_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid s_t, a_t) \cdot p(a_{t+1} \mid s_{t+1}, s_t, a_t)
$$
- Since $a_{t+1}$ action selection depends only on $s_{t+1}$ under policy $\pi_\theta(a_{t+1}|s_{t+1})$:
$$
p(s_{t+1}, a_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta(a_{t+1} \mid s_{t+1})
$$
Trajectory Factorization Diagram
Diagram omitted; see LaTeX for graphical details.
Interpretation: Each step factors as $\pi_\theta(a_t | s_t)$ and $p(s_{t+1} | s_t, a_t)$.
$$
p(s_{t+1}, a_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta(a_{t+1} \mid s_{t+1})
$$
Trajectory Probability Derivation
1. Initial State Sampling
$$
p(s_1) \quad \text{(Sampled from initial state distribution)}
$$
2. From Time Step 1 to T: Markov Transitions with Policy
- For each time $t = 1, \ldots, T$, we apply:
- Action selection: $a_t \sim \pi_\theta(a_t \mid s_t)$
- State transition: $s_{t+1} \sim p(s_{t+1} \mid s_t, a_t)$
- So the joint distribution for each step:
$$
p(s_{t+1}, a_{t+1} \mid s_t, a_t) = \pi_\theta(a_{t+1} \mid s_{t+1}) \cdot p(s_{t+1} \mid s_t, a_t)
$$
3. By the Chain Rule of Probability:
$$
p_\theta(\tau) = p(s_1) \cdot p(s_2, a_2|s_1, a_1) \cdot p(s_3, a_3|s_2, a_2) \cdots p(s_{T+1}, a_{T+1} \mid s_T, a_T)
$$
Unfolding this using the conditional factorization:
$$
p_\theta(\tau) = p(s_1) \prod_{t=1}^{T} \pi_\theta(a_t \mid s_t) \cdot p(s_{t+1} \mid s_t, a_t)
$$
Trajectory Probability Derivation (cont.)
$$
p_\theta(\tau) = p(s_1, a_1, \dots, s_T, a_T)
= p(s_1) \prod_{t=1}^T \pi_\theta(a_t \mid s_t) \cdot p(s_{t+1} \mid s_t, a_t)
$$
\begin{align*}
p_\theta(\tau)
&= p(s_1) \cdot \pi_\theta(a_1 \mid s_1) \cdot p(s_2 \mid s_1, a_1) \cdot \pi_\theta(a_2 \mid s_2) \cdot p(s_3 \mid s_2, a_2) \cdots \\
&= p(s_1)\cdot \pi_\theta(a_1 \mid s_1)\underbrace{ p(s_2 \mid s_1, a_1)
\cdot \pi_\theta(a_2 \mid s_2) }_{= p(s_2, a_2 \mid s_1, a_1)}
\underbrace{ p(s_3 \mid s_2, a_2) \cdot \pi_\theta (a_3|s_3)}_{= p(s_3, a_3 \mid s_2, a_2)} \cdots
\end{align*}
Since we have:
$$
p(s_{t+1}, a_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid s_t, a_t) \cdot \pi_\theta(a_{t+1} \mid s_{t+1})
$$
For example, if $p(x_{t+1} |x_t)$ is Markovian we have
\begin{align*}
p(x_1, x_2, x_3)&= p(x_1)p( x_2, x_3| x_1)\\
&=p(x_1)p(x_2|x_1) p(x_3|x_1, x_2)\\
&=p(x_1)p(x_2|x_1)p(x_3|x_2)
\end{align*}