Reinforcement Learning 8: Pick and Place Robot in an E-Commerce Store Warehouse i.e. Q-Learning in Action

Ashutosh Makone
6 min readAug 31, 2021

Walk the Talk

Generally, the Embedded people or the robotic amateurs, when experiment with pick and place robot without using reinforcement learning, they feed a full map of the environment in the memory of robot. What we are doing here is completely different. The robot will be trained using Q-Learning. It will know nothing about the environment initialy. It doesn’t even know that there is a destination where it has to reach. It will figure out everything over a lot of episodes using Q-Learning.

Its is very fascinating to see Q-Learning in action. To demonstrate that, i am considering a pick and place robot in an E-Commerce store warehouse. We will give the location of an object, which will also be the starting location of the robot and from there it will find the best path to carry the object to the packaging area. There are obstacles in the path and there is going to be a penalty (negative reward) if the robot bumps into the obstacle. Also it has to reach the packaging area in minimum amount of time, so there is a very small negative reward for every step it takes. The environment is shown below, in a not so pretty diagram. Take time to understand the environment to fully appreciate the power of Q-learning.

Fig 1: Map of warehouse of E-Commerce store.

Understanding The Environment

The total environment is divided into a grid of 8 by 8. So there are 8 rows and 8 columns. Those rows and columns are indexed from 0 to 7 as shown in the left and top of figure 1. These indices are also used in the code to help traversing using loops. But every cell in the grid is also given a code name. These code names are shown in top right corner of every cell. The names start from A1, A2…… A8 for first row and so on. The red cells are the one’s which are forbidden for the robot. If it bumps into this red area, then there is a rewards of -20 as shown in the bottom. From information given in the bottom, it is also clear that reaching the target at Cell E8 which is at index (4,7) has a reward of 100 (that’s where the packaging area is) and robot is allowed to move along all the white cells which has a reward of -1. This negative reward of -1 will motivate the robot to hurry up. The final result of the Q-Learning code is given in terms of code names and not in terms of indices, so that its easier to trace the output. In fact that was the very purpose of code names.

Understanding The Code

The code is available in my Github here . Please keep an eye on the code while reading this section. The code is mostly self explanatory (really!).

First we create a numpy array which represents the map of rewards in ware house. So the array element is -1 for white cells, -20 for red cells and 100 for green cells. Then we define epsilon for Epsilon Greedy Strategy. Then discounting factor, learning rate and number of actions. All these terms are already explained in my previous articles. A Q-table is defined as a numpy array and initialized with zeros. Its a three dimensional array with number of rows and columns of warehouse as first two dimensions and number of actions i.e. 4 as the third dimension.

A dictionary is created to map the indices with the code names. So a tuple containing indices is the key and a code name is the value of the dictionary. A function called location_next( ) takes current row and current column and selected action as input and returns the new row and new column that the agent will get into after performing the action. The action is decided by the function action_next( ) using Epsilon Greedy Strategy. The start_pos( ) functions assigns a random starting position for the agent during every episode of training.

Then a for loop which runs for 1000 times to train the agent, first gets a starting position for the agent from start_pos( ) function. Then a while loop runs until the agents doesn’t enter into an illegal(red colored) cell. Following steps are performed in the while loop that are essential for Q learning.

1. From current row and column, based on Epsilon Greedy Strategy next action is selected.

2. The current row and current column are saved as old row and old column.

3. The location_next( ) function gives next location based on the selected action

4. Reward is obtained for this selected location.

5. For on action performed in old row and column, the Q value is obtained from the map_rewards numpy array.

6. Using reward obtained in step 4 and Q value of old location obtained in step 5, the temporal difference is calculated. This calculation also requires the value of gamma and the maximum Q value for current position of agent.

7. Using the temporal difference, Q value of old location and alpha, a new Q value is calculated.

8. This new Q value is updated in the Q table.

To understand these steps fully refer to my previous article,”Reinforcement Learning 7: Q-Learning”. I am giving the relevant equation from that article i.e.equation 3 for reference below.

This equation which updates the values in Q table is used to train the agent with reinforcement learning and is coded in python using steps given above.

After training, finally when final_path( ) function is called with any starting location for agent that we choose, the final_path( ) function returns, the best path for the agent (pick and place robot in our case). Now that the training is over and its time to get the results, the action_next( ) function takes value of 1 for epsilon to exploit the environment. Suitable actions are chosen from the learned Q-table and the best path is returned in the form of series of code names. The reverse mapping from indices of locations to the code names is done by the get_codeNames_path( ) function.

Go through this explanation of code multiple times to get full understanding of the code available in my Github.

Results

Finally when the code is tested for multiple random input locations of robot, accurate results are obtained.

  1. For example

If the input locations of robot is given as (3,3), the it chooses the path as shown in above image. I have drawn this path with blue color on the warehouse map with freehand for demonstration as shown below.

2. The second input is given as (5,0)

The output look like this ..

3. The third input is

And corresponding output is

More sample inputs and their output paths are given in Github.

I hope this implementation of Q-Learning is quite clear to you now. Please feel free to give suggestions, feedback or ask queries

Happy Coding !!!

--

--

Ashutosh Makone

I am a hands-on guy. I appreciate the beauty of theory but understand its futility without application. ML, DL and computer vision are my interests.