A new innovation from MIT helps robots understand and respond to voice commands – putting us one step closer to an Alexa with arms.

Though the field of robotics is rapidly advancing, in many ways consumer-ready robotic technologies are still far less sophisticated than AI-powered voice assistants like Siri or Alexa.
Robots can be programmed to execute repetitive tasks or movements, but their usefulness typically stops there. Since most robots (even industrial ones) lack language processing abilities, we’re still far out from a future where we regularly interact with robots in ways that resemble human communication, interaction, or responsiveness.
But a team of computer science researchers just pushed us closer to that reality: A new paper out of MIT presents an Alexa-like system that allows robots to understand a wide range of commands that require contextual knowledge about objects and their surroundings.
The researchers’ “ComText” system takes robot learning further than most advances to date – enabling robots to develop the “episodic memory” necessary to reason, infer meaning, and respond to nuanced commands.
Traditional approaches to robot learning focus mainly on semantic memory, which is based on general facts such as “the sky is blue.”
Training of semantic memory can allow robots to execute basic commands: With literal understanding of distance and direction, for example, a robot may be able to respond to a request to move two feet to the right or left.
Requests involving contextual knowledge demand a cognitive skill set most robots lack, however. Consider a command like “grab my jacket.” Even if a robot knows what a jacket is, the idea of possession – of retrieving your jacket – demands a more nuanced, human level of understanding than the robot has.
“The main contribution is this idea that robots should have different kinds of memory, just like people,” says research scientist Andrei Barbu.
ComText helps close that “semantic gap” by training robots to develop both semantic and episodic memory. Episodic memory is based on experiences in the past that can inform decisions made in the future.
Possession is one example: A robot with episodic memory that is told, for example, “this jacket is mine,” can retain that information and use it when responding to later commands even in langague constructions that do not follow the same structure as how they were initiatally taught about possession of the jacket.
The new Alexa-like solution makes that kind of cognition possible. Developed by researchers in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), ComText enables robots to understand the world as a collection of objects, people, and abstract concepts (such as position and possession).
Named for “commands in context,” ComText debuted in a research paper developed by MIT researchers Rohan Paul, Andrei Barbu, Sue Felshin, Boris Katz, and Nicholas Roy for the 26th International Joint Conference on Artificial Intelligence.
The research work helps robots “ground” (aka understand and interpret) natural language instructions. It creates a learning model – referred to as Temporal Grounding Graphs – whereby robots continually acquire “higher-order semantic knowledge” about their environments and the objects within them.
A robot equipped with the ComText system learns as it goes: As commands and responses are exchanged between humans and the robot, the robot retains more and more experiential knowledge – as well as more visual and linguistic context – based on the past commands.
“For robots to understand what we want them to do, they need a much richer representation of what we do and say,” says Paul.
In a video released by the MIT media team, we see Baxter (a two-armed humanoid robot) understanding various items described as “my snack,” and responding to commands to “pick it up” or “pack up my snack.” With ComText, Baxter was successful in executing the right command about 90% of the time.
That kind of cognition may seem simple, but it gets us one step closer toward a world where a robot packs your snacks (or grabs your jacket). As the technology develops further, the ability for a robot to gain greater understanding of placement and possession could mean a robot can help you find your keys, since it could remember where you left them.
That kind of understanding is the researchers’ next step. In the future, the MIT team hopes to enable robots to understand more complicated information – such as multi-step commands, or the intentions behind actions.
If they succeed, the ability to interface with robots more like humans could go well beyond a helpful AlexaBot around the house: In industrial settings, for example, robots will prove far more useful to people if they understand normal voice commands. With ComText, simple instructions like “Go get the last box I put down” could be done by a robot (instead of taking up a person’s time).
The project for “Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context” was co-led by research scientist Andrei Barbu, alongside research scientist Sue Felshin, senior research scientist Boris Katz, and Professor Nicholas Roy. They presented the paper at last week’s International Joint Conference on Artificial Intelligence (IJCAI) in Australia. Full information is available here.
If you aren’t already a client, sign up for a free trial to learn more about our platform.