Why is the optimal policy in Markov Decision Process (MDP), independent of the initial state?

enter image description here

I was following the reinforcement learning lecture notes on CS229 (which can be referenced for the notation I am using in this question): http://cs229.stanford.edu/notes/cs229-notes12.pdf and I had a question about the following paragraph: My specific question was, why is it that $\pi^*$ has the property that its the optimal policy for all states? I guess that he didn't prove it on his notes because its obvious for him, but why is that true? Is there a proof of that somewhere? Or does someone know at least some intuitive argument for it? I am just very surprised, because it seems very counter intuitive, specially because of the way the optimal value function is defined on page 4. It defines: The optimal value function: $$V^*(s) = \underset<\pi> \ V^<\pi>(s)$$ The way I understand it is that, its the best possible expected sum of discounted rewards that can be attained by using any policy. However, it is seems to be that its a function of s and for each s we maximize over $\pi$. So how come we don't end up with a different optimal $\pi$ for each state? For more notation relevant to my question read bellow (or read page 4, or from page 4 to page 1 of the notes I linked): Recall what the value function is: $$V^<\pi>(s) = E[R(s_0) +\gamma R(s_1) + \gamma^2R(s_2) + \cdots \mid s_0 = s ; \pi]$$ which is the expected sum of discounted rewards upon starting in state s and taking actions according to the given policy $\pi$ (note $\pi$ is not a r.v. but a "fixed" parameter mapping states to actions). On page 4 of CS229 notes, it defined the following quantities: Thus, we can re-write bellman's equations with this "best" valued function: $$V^*(s) = R(s) + \underset \ \gamma\sum_ P_(s')V^*(s')$$ which says that the best value function for state s is the initial reward plus the reward from the action that maximizes our weighted future pay-off. i.e. plus the reward of doing the best thing now that would make us get the best pay-off in the future. From that we see that we can get the best policy by "extracting" the best action for each state according to the equation above: $$ \pi^*(s) = \underset\sum_ P_(s')V^*(s') $$ The it states that for ever state and every policy $\pi$ we have: $$ V^*(s) = V^<\pi^*>(s) \geq V^<\pi>(s)$$

5 5 5 bronze badges

asked Jan 10, 2015 at 4:11

Charlie Parker Charlie Parker

6,986 14 14 gold badges 77 77 silver badges 130 130 bronze badges