Deep Imitation Learning: Teaching a Robot to Behave Like a Human

by Jonas Hansen

Figure 1. The dual-arm, teleoperated autonomous robot (“Auto”) that I worked on this
summer at the Intelligent Systems and Informatics Laboratory at The University of Tokyo.


Have you ever wondered how modern robots are able to perform complex tasks autonomously, tasks usually reserved for humans? From preparing salads to creatively painting canvases, how can a robot mimic human behavior? Well, I used to wonder the same thing. It turns out that there is a method of machine learning in the artificial intelligence world called deep imitation learning, whereby a robot can be trained to perform a task by mimicking the approach an external entity might use to perform the same task. Often, this external entity takes the form of a human. For example, if a person wanted to teach a robot to make a salad, the person could set up a system where they themselves would prepare the salad over and over again to develop a model for how to perform the salad-preparation task. This model would then be uploaded to the robot such that it could essentially copy the task the same way the person did it.

Deep imitation learning was at the heart of the research I conducted this summer at the University of Tokyo’s Intelligent Systems and Informatics (ISI) Laboratory, led by Professor Yasuo Kuniyoshi. Over the course of two months, I worked with a PhD student, Kim-san, at the lab on a novel dual-arm teleoperated robot system to use deep imitation learning to train the robot to perform certain tasks. The majority of the research involved writing code and developing software for the robot system, although I also worked on hardware during the development process. The following sections describe the specific tasks I worked on, as well as the methodology and the many problems I encountered along the way.


The main space I was working in during my research was the robotics lab, which housed all kinds of gadgets and robot systems, such as a badminton-playing robot, a walking bipedal robot, and a robot arm with a fully-functioning five-fingered hand as its end effector. The robot system I worked on though was a dual-arm teleoperated robot, that had two UR5 robot arms with simple grasping grippers as end effectors and a 3D stereo camera on a dual-axis gimbal system for vision (see Figure 1). We’ll call this robot “Auto.” Next to this robot sat a very similar-looking robot (two arms, grippers, camera system), but this was a custom-built “skeleton” meant to be operated by a human. We’ll call this one “Skeleto.” To operate Auto, a person would sit in a chair in Skeleto, put on a VR headset, and grab Skeleto’s two arms. When the person moved their head, the camera system on the other robot, Auto, would move accordingly, so the person would see what Auto was seeing through the VR headset attached to Skeleto. Similarly, whatever motion the person would do with Skeleto’s arms, the arms of Auto would copy synchronously.

By having this setup, the person could use Skeleto to make Auto perform certain tasks over and over again, and then use this data to train Auto to perform those tasks by itself (without any teleoperation from Skeleto). On top of the deep imitation learning aspect, the other novelty in this system lies in the fact that the VR headset is able to track the gaze of the person, so Auto is able to learn to “see” what objects to focus on based on the person’s gaze point. This of course adds an additional layer of complexity, however it also makes the system extremely versatile for learning to perform all sorts of tasks.

Task 1 – Orientation and Operation

As one might imagine, being able to confidently operate the robot system described above is no easy task. The first few days of the internship were solely dedicated to familiarizing myself with the robot system and related software. As Kim-san showed me how to set up and operate different systems, I took pictures and kept a step-by-step log of the instructions. I also operated the robot for hours on end, performing different tasks such as moving objects, stacking objects, tying knots, threading needles, unpeeling bananas, etc. By the end of these first few days, I put together an operation and debugging manual for the robot outlining these instructions in detail,

as well as what to do if certain common errors would arise. This manual will now be used by interns/researchers hoping to use the same robot system in the future.

Task 2 – Project Brainstorm and Outline

Working with Kim-san, I brainstormed ideas for what my research should focus on. We discussed the current limitations of the robot, and one such limitation is that even though the camera system has the ability to rotate along two axes, the robot’s current framework is only able to use vision with a static camera system. This is extremely problematic because it limits the robot’s workspace to the static image dimensions of the camera. So for example, even if the task was something trivial like moving an apple, if the apple was placed outside the field of view of the camera, the robot would never be able to grab it because it would never be seen by the camera. We decided that this would be an excellent research area for me to explore, and hopefully be able to tackle this shortcoming by the end of the internship. As it turns out, this problem proved to be extremely challenging, and required a very unique approach that had not been done before.

Task 3 – Research and Mathematical Theory

Once the overall project goal was set, the research phase could commence. This involved an extensive deep dive into the online world of academia, reading countless research papers and past projects to see if similar deep imitation learning techniques had been used for other robotic systems. While deep imitation learning has certainly been applied in many other systems, the novelty of this robot system made it extremely difficult to find anything truly similar. So after spending nearly a week trying and failing to find past work to piggyback off of, my worst fear was realized: I would have to actually sit down and do out all the math from scratch. For the next two weeks, I spent late nights in the lab trying to work out the relevant geometry, trigonometry, and state space calculations to enable the robot to perform trained tasks in a translated coordinate system. In other words, I had to figure out how to adapt the current mathematical framework of the system to a mobile camera system—if the camera rotated, the coordinate system would have to rotate as well. The optimal solution I found to this problem was to calculate the global real-world coordinates of the target object (e.g. the apple to be picked up, the banana to be unpeeled, etc.) and to then transform the robot’s coordinate system in relation to these global coordinates. Eventually, after a great deal of testing and debugging, this approach proved to be successful, meaning we could move on to the training and data collection phase.

Task 4 – Robot Training

With the mathematical framework in place and the software adjusted accordingly, we began to test the robot with simple tasks. We started with simply reaching and picking up a kiwi that was placed outside the static field of view of the camera. I would spend several hours performing this

same task over and over again on Skeleto, turning my head to first find the kiwi, adjust my gaze to look directly at the kiwi, then reaching out and grabbing the kiwi. After completing this training and uploading it to Auto, we had Auto try the task on its own. The good news? The camera system successfully rotated and found the kiwi. The bad news? The robot arm would completely miss the kiwi when trying to reach it. After trying to make slight tweaks to the code for the robot arm, I realized the problem: my new global object coordinates code was completely wrong. In testing it, I was using a separate camera not attached to the robot that had different settings and software in place. It didn’t occur to me that the camera hooked up to Auto might have a different software setup. Luckily, this wasn’t too big of an adjustment, so within a day or two, I was able to update the code to work for Auto’s camera system. Testing it again, the arm performed much better, reaching the kiwi most of the time while maintaining the correct coordinate system transformation.

Task 5 – Improvements

One thing I noticed during the training phase (where I would be repeating the same task over and over again on Skeleto) was that Skeleto’s arms and grippers were very awkward to hold and manipulate. So I brainstormed ways of making it more user-friendly and designed a 3D model of a gripper system using a Computer Aided Design (CAD) software called OnShape. Using the lab’s 3D printers, I printed this new gripper system and attached one to each of Skeleto’s two existing grippers to make it much easier to operate. Indeed, this sped up the training process, and made Kim-san very happy.

Reflection and Future Work

Looking back on this research, I am extremely pleased with how things turned out, and had one of the best internship experiences I have ever had. Not only was the research fun and engaging, but also challenging in a way that pushed me to think creatively and apply my past knowledge and experience in a completely new way. It is refreshing to think about how this research truly was unique and original, which I did not expect to accomplish at this early stage in my educational career.

Due to the novelty of this research, Kim-san and I will be putting together a research paper of our work and findings, and publishing it to a few different engineering and computer science journals. I look forward to continuing to engage with this research by writing this paper with Kim-san, not to mention gain valuable experience in how to write and publish professionally. After embarking on such an incredible journey at the University of Tokyo, I most certainly see myself returning there in the future to pursue similarly enriching research and to engage with some of the most brilliant minds in the world of engineering and robotics.