Reinforcement Learning 2: Terminology and Bellman Equation

Ashutosh Makone
7 min readJul 11, 2021

… and so the adventure begins !!!!

Photo by Susan Q Yin on Unsplash

Richard S. Sutton and Andrew G. Barto in their book titled “Reinforcement learning”, which is main inspiration for this series of articles, describe it as…

Reinforcement learning is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics — trial-and-error search and delayed reward — are the two most important distinguishing features of reinforcement learning.

Ohhh wait !!! is that too much take in one go? Lets start by understanding the basic terminology with the help of a simple example of a maze ….

Terminology

Figure 1 shows a simple maze.

Figure 1 : Maze

Agent : Agent is someone who explores and tries to get rewards in an environment. In our example, its a robot, which currently stands in state 1. In general it can be any smart entity like a piece of software. The goal of an agent is to maximize the reward.

Environment : The situation which is being explored by the agent is call environment. The entire grid with all 12 states is the environment in figure 1.

States : Various positions/locations in the environment are called states. There are 12 states starting from 1 to 12 in this environment. There might be any forbidden states, like state 6 in the figure. So if our environment is a room, then state 6 may be surrounded by walls, so that the robot is unable to get into that state.

Actions : Actions are allowed moves for the agent to move in the environment. Our robot is allowed to take four actions : move-up, move-down, mode-left, move-right. These actions are limited by the current position of the agent. For example, from state 1, it can either move right or move down. So other actions are not allowed. When in state 11, it can move left, right or up.

Reward: When the agent moves in maze and reaches a desirable position, then it gets some positive reward while there might be positions where the agent should not enter to avoid any loss. In the maze, state 12, which has a trophy symbol in it, is the desirable state, so the agent might get a reward of +10 when it reaches there. But by mistake if it enters state 8, which has a danger symbol in it, it might get a reward of -10. So to maximize the reward, the agent has to reach state 12. In other words, reward is a result of certain action that an agent takes to move into certain state. Remember that initially the agent knows nothing about the environment and it learns to maximize its rewards with experience by many trials and errors. Apart from state 8 and 12, other states can also be assigned rewards, depending on the problem statement.

For example, apart states 6, 8 and 12, we assign a reward of -1 for moving into every other state. Thus if the agent moves from state 1 along 2–3–7–11–12, the total reward is ( -1–1–1–1+10)=6 while if it moves from state 1 along 2–3–4–3–7–11–12 then the total reward is ( -1–1–1–1–1–1+10)=4. And by comparison the agent learns to avoid going from 3 to 4. This is how we force him to learn the best way to maximize the cumulative reward. Hence the name reinforcement learning.

The reward is mostly denoted by R and the general form of expressing it is as follows

R(s,a) = reward of moving from state s with action a.

For example, in our grid

R(s1,”move-right”) = -1

R(s4,”move-down”)= -10

R(s11,”move-right”)= +10

Interaction between Agent and Environment

The Interaction between Agent and Environment can be nicely explained using figure 2 below

Figure 2: Interaction between Agent and Environment

Suppose the agent is in state s1 and from there it takes as action a1 at time step t1 and due to this interaction with environment it gets a reward of R2 and it goes into state s2. Again at time step t2 it will take an action a2 due to which it will get a reward R3 and it will go into state s3. Thus the interaction will proceed as per following sequence

s1,a1,R2,s2,a2,R3,s3,a3,R4………… and so on

In our maze example, this process will go on until the agent either reaches s8 where it gets the worst reward and the process terminates or it reaches s12 where it gets the best reward and the process terminates. This entire sequence becomes one experience for the agent and the process restarts as above from s1,a1,R2,s2,a2,R3 ……… and so on, for next experience.

There are two important things about this interaction

  1. The environment is completely unknown to the agent. The agent only knows that it can move left, right, up or down. But is doesn’t know how many total states are there in environment. It doesn’t know that there is a big reward waiting for it at s12 or there is a big loss in reward waiting for it in s8. It has to learn everything with experience.
  2. The environment is not static. It may change over time.

Bellman Equation

The Bellman equation is used to calculate value function for all valid states and it is as follows

First i will explain this equation, then i will give sample calculation for Bellman equation to get value function and then i will show how to get the optimal path using value function. So have patience ….

The gamma in Bellman equation is called discounting factor. its value is in the range 0 to 1. For now lets consider that gamma =1. (more about it in a while)

The R(s,a) term in the equation is called immediate/instant reward. For example in our maze environment R(s1,”move-right”) = -1. Now s’ is the next state where the agent goes from state s. Hence V(s’) is the value function of state s’, i.e. the maximum reward that can be obtained when agent starts from s’. Consecutively to get V(s’) we need R(s’,a) and V(s’’) and so on

So if s1 is the starting state, then from there it can go to 2 or 5. If it goes to 2 then next state can be 1 or 3 (yes agent can go back, the environment is completely unknown to it.) If, from 1 it goes to 5 then next state can be 1 or 9. From 3 it can go to 2, 4 or 7. And so on ……. There are enormous number of possibilities.

Sample Calculations

In Bellman equation s indicates current state and s’ indicates next state. so lets calculate the value function for state 1. As i said there are enormous number of possibilities to roam around the maze if you start from 1. We have to calculate the RHS of Bellman equation for all those possibilities and choose the maximum as v(1) i.e. value function of 1. evidently 1–2–3–7–11–12 looks like the optimal path, hence lets do the calculations for this one.

This is just one of the values obtained from one possible paths from state 1 to any other state. You can see that these calculations are recursive. Doing the same calculations for all other possible paths will give more values of v(1) and as the equation suggests , we have to choose the maximum out of them. In our maze 5 is evidently the maximum value possible. You must have already guessed that the same value can be obtained over the path 1–5–9–10–11–12. But still 5 is the maximum we have got.

Hence v(1) = 5.

Similarly we can calculate v(2) by considering all the possible paths that start from s2 and reach to any other state. Please try these calculations. I am giving all the values in figure 3 below

Figure 3: Maze with value functions for all states

The value functions for all states are given in red and these values are circled in figure 3. After getting these values, it is now easier for the agent to navigate for maximum reward. It has to just move in the direction of increasing values of value function. So if the agent is in state s3, it can go to s2, s4 or s7. But s7 has higher value function so it will move to s7 and so on. So now our agent has learnt by reinforcement technique.

What about discounting factor (gamma) ?

In Bellman equation R(s,a) is the instant reward of current action and V(s’) is the cumulative future reward of all actions that follow from s’ onward. Thus the value of gamma decides how much importance is to be given to the cumulative future reward as compared to the instant reward R(s,a). if gamma = 1, then it means equal importance is given to all future rewards. If gamma = 0.9, then 90% importance is given to reward at next state s’ and 81%(0.9 x 0.9 = 0.81) importance is given to reward at next to next state and so on. And according all calculations will change.

--

--

Ashutosh Makone

I am a hands-on guy. I appreciate the beauty of theory but understand its futility without application. ML, DL and computer vision are my interests.