Aone-armed robot stood in front of a table. On the table sat three plastic figurines: a lion, a whale and a dinosaur. An engineer gave the robot an instruction: “Pick up the extinct animal.”
The robot whirred, then its arm extended and its claw opened. It grabbed the dinosaur.
Until very recently, this would have been impossible. Robots weren’t able to reliably manipulate objects they hadn’t seen before, and certainly weren’t capable of making the logical leap from “extinct animal” to “plastic dinosaur”.
A quiet revolution is underway in robotics, one that piggybacks on recent advances in so-called large language models. Google has begun plugging state-of-the-art language models into its robots, giving them the equivalent of artificial brains. I got a glimpse of that progress during a private demonstration of Google’s latest robotics model, called RT-2.
“We’ve had to reconsider our entire research programme as a result of this change,” said Vincent Vanhoucke, Google DeepMind’s head of robotics. “A lot of the things that we were working on before have been invalidated.”
Robots still fall short of human-level dexterity and fail at some basic tasks, but Google’s use of AI language models to give robots new skills of reasoning and improvisation represents a promising breakthrough, said Ken Goldberg, a robotics professor at the University of California, Berkeley, US. “What’s very impressive is how it links semantics with robots,” he said.
For years, the way engineers at Google and other companies trained robots to do a mechanical task — flipping a burger, for example — was by programming them with a specific list of instructions. Robots would then practise the task again and again, with engineers tweaking the instructions each time until they got it right.
But training robots this way is slow and labour-intensive. It requires collecting lots of data from real-world tests. And if you wanted to teach a robot to do something new — to flip a pancake instead of a burger, say — you usually had to reprogram it from scratch.
Researchers at Google had an idea. What if, instead of being programmed for specific tasks one by one, robots could use an AI language model — one that had been trained on vast swaths of Internet text — to learn new skills?
“We started playing with these language models around two years ago, and then we realised that they have a lot of knowledge in them,” said Karol Hausman, a Google research scientist.
Google’s first attempt to join language models and physical robots was a research project called PaLM-SayCan. But its usefulness was limited. The robots lacked the ability to interpret images — a crucial skill, if you want them to be able to navigate the world.
Google’s new robotics model, RT-2, can do just that. It’s what the company calls a “vision-language-action” model, or an AI system that has the ability not just to see and analyse the world around it, but to tell a robot how to move.
It does so by translating the robot’s movements into a series of numbers — a process called tokenising — and incorporating those tokens into the same training data as the language model. Eventually, just as ChatGPT or Bard learns to guess what words should come next in a poem or an essay, RT-2 can learn to guess how a robot’s arm should move to pick up a ball or throw an empty soda can into the recycling bin.
In an hourlong demonstration, which took place in a Google office kitchen littered with objects from a dollar store, my podcast co-host and I saw RT-2 perform a number of impressive tasks. One was successfully following complex instructions such as “move the Volkswagen to the German flag,” which RT-2 did by finding and snagging a model VW Bus and setting it down on a miniature German flag several feet away.
It could also follow instructions in languages other than English, and make abstract connections between related concepts. When I wanted it to pick up a soccer ball, I said “pick up Lionel Messi”. RT-2 got it right on the first try.
“This really opens up using robots in environments where people are,” Vanhoucke said. “In office environments, in home environments, in all the places where there are a lot of physical tasks.”