Reinforcement Learning 6: Exploration vs Exploitation

Ashutosh Makone
3 min readAug 22, 2021

How much greedy is good enough.

Before getting into Q-Learning, lets understand the importance concept of Exploration vs Exploitation.

Exploration vs Exploitation

Exploration is trying new possibilities in order to find better rewards. Exploitation is to keep opting for same actions that has given some significant rewards in the past. To illustrate this we can consider an example of restaurants. Suppose i recently start living in a new city. On a lazy weekend, i am fade up of cooking myself and and want to try a restaurant. I start visiting different restaurants on every weekend. In 2 to 3 weekends i discovered a restaurant and i found it really good. So for next few weekends i visited the same restaurant. This has now become my number one favorite restaurant. This is called exploitation: keeping visiting something that is sure to get some good result/reward. But in doing so, i might be missing on restaurants which are even better than this one. But by visiting only this restaurant, i will never know what i am missing. So a better strategy is to visit a different restaurant once in a while. Doing this may sometime lead to disappointment (due to not so good food as compared to my number one favorite restaurant), but sometime i may find a restaurant which is better than my previous number one favorite restaurant. This is called exploration. Exploration is trying new unknown actions which may lead to either disappointment or better reward.

Thus there are two terms : Exploration and Exploitation. Exploitation gives the agent a guaranty of reward, but it may not be the best reward possible in the environment. In Exploitation a lot of environment remains unexplored. While in Exploration, agent tries to opt for unknown actions, which he hasn’t tried before which takes him to unknown states, this may lead to either disappointment or better reward. For best possible results from reinforcement learning, both Exploration and Exploitation are important.

What is a perfect balance between Exploration and Exploitation is an important question and there are various strategies to deal with this, This dilemma is unique to reinforcement learning and its never encountered in supervised or unsupervised learning.

In a planning context, exploration means trying actions that improve the
model, whereas exploitation means behaving in the optimal way given the current model.

To make it even more clear let me give a quick example

Fig 1: Exploration vs. Exploitation

Suppose the agent is in state 5. Its start moving in the environment randomly. It goes up to state 1 and then to state 2 where the reward is +3 as written in bracket. Now that it knows, the path 5–1–2 always gives a rewards of +3, if it keeps selecting these actions, then its called Exploitation. But this way rest of the environment will remain unexplored. So, once in a while if, it tries a different path to find if there is a better reward waiting for it somewhere, then that’s called Exploration. So instead of moving to state 1 , if it moves into state 9, in few episodes, it may discover that there is a better reward in state 12.

Epsilon greedy strategy

One way to keep a balance between Exploration and Exploitation is called as Epsilon greedy strategy. Epsilon is the Exploration rate whose value is always between 0 and 1. When Epsilon=1, the agent is in fully exploration mode.So it only cares about exploring the environment. This is a good way to start initially. But after a while when the environment becomes more familiar, the agent may decide to reduce the value of epsilon gradually. So now after a while if Epsilon becomes equal to 0.9, the 90% of the time the agent is still exploring the environment while 10% of the time it is interested in grabbing the rewards that it has discovered during exploration in the past. As environment becomes more and more familiar and there is less of it remained to explore, the value of Epsilon goes on decreasing and exploiting becomes more important. In other words, agent becomes more greedy over the time.

--

--

Ashutosh Makone

I am a hands-on guy. I appreciate the beauty of theory but understand its futility without application. ML, DL and computer vision are my interests.