# Reinforcement Learning 3: Finite Markov Decision Processes (part-1)

Learning along the way …..

# Markov Chains and Markov Process

According to Wikipedia,

“A

Markov chainis a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event”.

**Andrey Andreyevich Markov **was a phenomenal Russian mathematician who is best known for his work on stochastic processes. In probability theory , any process with which some randomness is associated is called stochastic process. Markov chain talks about a sequence of events in which the probability of every event depends on its previous event. For example if there is a sequence of events like e1, e2, e3, e4, e5, e6…….. whose corresponding states are s1, s2, s3, s4, s5, s6….. respectively then the probability of occurrence of event say e5 corresponding to state s5 depends only on state s4 of event e4.

Thus mathematically,

P(s5|s4) = P(s5|s4,s3,s2,s1)

Yes, the vertical line in above equation has the same meaning of conditional probability. Thus s4 is the present state and P(s5|s4) is the probability of occurrence of s5. So the probability of state s5 given s4 has occurred is same as probability of state s5 given s4,s3,s2 and s1 has occurred. Thus to predict s5, s4 is enough and the information associated with s3, s2 and s1 is not necessary. Such a sequence of possible events is called as Markov chain. A simple example of Markov chain is a game of snakes and ladders which is played with a dice. Every next move in such a game depends on its current state(position) and not on previous states.

The same concept can be applied to a Markov process.

According to Wikipedia,

A stochastic process has the Markov property if the conditional probability distribution of future states of the process (conditional on both past and present values) depends only upon the present state; that is, given the present, the future does not depend on the past. A process with this property is said to be

Markovianor aMarkov process.

Thus a process in which the occurrence of future states depends only on the present state and not on the past states.

# Markov Decision Process

According to Wikipedia,

In mathematics, a

Markov decision process(MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.At each time step, the process is in some state s and the decision maker may choose any action a that is available in state s. The process responds at the next time step by randomly moving into a new state s ′ and giving the decision maker a corresponding reward R( s , s ′ )

There are various things to understand

- MDP is a discrete process. That means the events are going to occur at discrete time steps like t1, t2, t3, t4, t5……. and so on.
- The outcomes are partly random. They are not fully under the control of a decision maker. That means there is some stochasticity associated with them. The decision maker is the one that was referred as Agent in my last article.
- In every state s, the Agent has a set of possible actions available to him. Depending on which action he chooses, he goes into a new state denoted by s’. And depending on the effectiveness of this action, he gets a reward R.

# Finite Markov Decision Process

A Markov Decision Process has three elements associated with it.

**S:**A set of all possible states also called as state space.**A:**A set of all possible actions also called as action space.**R:**Immediate reward that the agent receives after transitioning from state s to s’.

A Markov Decision Process in which there are finite number of elements in set S, A and R, i.e. there are finite number of states, actions and rewards, is called as Finite Markov Decision Process.

If the previous state is s and a is the action taken in previous state, then the probability of next state being s’ with a reward of r is given by

The equal to sign with a dot on it indicates that its not really an equation. The RHS is merely elaborated definition of LHS. So if s and a are the state and action in time step t-1 then the probability of s’ being the next state with reward r is given by above equation/definition. The quantities on right hand side of vertical line | in LHS and RHS are previous state and previous action while those on left right hand side of vertical line are probable present state and reward.

For this process to be a Markov decision process, every state in the process must include the necessary information about all the past interactions between agent and environment that may affect the future decisions. If every state, in process satisfies this condition then it is said to have the Markov property.

The expected rewards for state action pairs can be expressed as

E indicates the expected value of expression inside square bracket. The expression inside the square bracket is reward R at this time step t if the previous state was t-1 and previous action was a.

In general, actions can be any decisions we want to learn how to make and the states can be anything we can know that might be useful in making them.

## More about Agent and Environment

- Agent is part of environment but it is crucial to define the boundary between agent and environment.
- Anything, that an agent cant change arbitrarily is considered to be outside of it and is part of the environment.
- Every thing in the environment is not always assumed to be unknown to the agent.
- The agent often knows how the reward is calculated as a function of its actions and states.
- But the agent do not have any control over how the reward is computed.
- Sometimes the agent knows everything about the environment, but it doesn’t always mean that its easy for the agent to solve the Reinforcement Learning problem, just as we know how the Rubik’s cube works, but its not easy to solve it.
- The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.
- Reinforcement Learning problem is the problem of goal directed learning from interactions of agent with the environment inside the abstract MDP framework.
- Any such problem can be reduced to three signals : choices made by agent(actions), the basis on which the choices are made(states), the agent’s goal (rewards).

# Goals and Rewards

The environment gives a reward to agent for each action taken by an Agent. A reward is nothing but a scalar signal. It can be a positive reward for a good action, negative reward for a bad action and a zero reward for actions that do not matter much. This is just one way in which this can be done. There can be others schemes as well. The agent’s goal is to maximize the rewards. Not just the immediate reward for present action taken from present state, but also all the future rewards. In other words, we can say the goal is to maximize the cumulative reward in the long run.

So the reward hypothesis goes like this:

That all of what we meant by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)

For example, if we want to train a robot to find and collect empty soda cans for recycling, then we can give a positive reward, every time it finds a can. We can give a negative reward, every time it bumps into things, so that it learn to escape things. If we give a small negative reward for every step, that do not result in collection of a can, the robot will learn to find the cans as quickly as possible.

There can be situations in which rewards can be given for intermediate steps. For example in games where the player is running along a track or a path in a territory, then intermediate rewards can be given for collecting gold coins or weapons. But sometimes giving intermediate reward may actually land up the agent in a bad situation. For example while playing chess, we know that in some situations, we may sometimes have to sacrifice our pieces or skip the chance of taking opponents pieces, to increase our chances of winning. So a positive reward can be given to agent, only if the agent wins the game and not for taking the opponent’s pieces. Otherwise the agent instead of trying to win the game, may try to maximize the reward by taking the opponent’s pieces. It may make moves for taking the opponent’s pieces that may result in making its own kings vulnerable or loosing the opportunity of opponent’s checkmate.

The final take away is “ The reward signal is your way of communicating to the agent what you want it to achieve, not how you want it achieved”