Physical Intelligence, the two-year-old, San Francisco-based robotics startup that has quietly develop into probably the most intently watched AI corporations within the Bay Area, revealed new analysis Thursday exhibiting that its newest mannequin can direct robots to carry out tasks they have been never explicitly educated on — a functionality the corporate’s personal researchers say caught them off guard.
The new mannequin, known as π0.7, represents what the corporate describes as an early however significant step towards the long-sought objective of a general-purpose robot brain: One that can be pointed at an unfamiliar activity, coached by it in plain language, and truly pull it off. If the findings maintain as much as scrutiny, they recommend that robotic AI could also be approaching an inflection level much like what the sector noticed with massive language fashions — the place capabilities start compounding in ways in which outpace what the underlying knowledge would appear to foretell.
But first: The core declare within the paper is compositional generalization — the power to mix abilities realized in several contexts to unravel issues the mannequin has never encountered. Until now, the usual method to robot coaching has been primarily rote memorization — accumulate knowledge on a particular activity, prepare a specialist mannequin on that knowledge, then repeat for each new activity. π0.7, Physical Intelligence says, breaks that sample.
“Once it crosses that threshold where it goes from only doing exactly the stuff that you collect the data for to actually remixing things in new ways,” says Sergey Levine, a co-founder of Physical Intelligence and a UC Berkeley professor targeted on AI for robotics, “the capabilities are going up more than linearly with the amount of data. That much more favorable scaling property is something we’ve seen in other domains, like language and vision.”
The paper’s most placing demonstration includes an air fryer the mannequin had primarily never seen in coaching. When the analysis crew investigated, they discovered solely two related episodes in the whole coaching dataset: One the place a completely different robot merely pushed the air fryer closed, and one from an open supply dataset the place yet one more robot positioned a plastic bottle inside one on somebody’s directions. The mannequin had by some means synthesized these fragments, plus broader web-based pretraining knowledge, into a practical understanding of how the equipment works.
“It’s very hard to track down where the knowledge is coming from, or where it will succeed or fail,” says Ashwin Balakrishna, a analysis scientist at Physical Intelligence and a Stanford laptop science PhD scholar. Still, with zero teaching, the mannequin made a satisfactory try at utilizing the equipment to cook dinner a candy potato. With step-by-step verbal directions — primarily, a human strolling the robot by the duty the way in which you would possibly clarify one thing to a new worker — it carried out efficiently.
That teaching functionality issues as a result of it suggests robots might be deployed in new environments and improved in actual time with out further knowledge assortment or mannequin retraining.
So what does it all imply? The researchers aren’t shy in regards to the mannequin’s limitations and are cautious to not get forward of themselves. In at the least one case, they level the finger squarely at their very own crew.
“Sometimes the failure mode is not on the robot or on the model,” Balakrishna says. “It’s on us. Not being good at prompt engineering.” He describes an early air fryer experiment that produced a 5% success fee. After spending about half an hour refining how the duty was defined to the mannequin, it jumped to 95%, he says.

The mannequin additionally isn’t but able to executing advanced multi-step tasks autonomously from a single high-level command. “You can’t tell it, ‘Hey, go make me some toast’,” Levine says. “But if you walk it through — ‘for the toaster, open this part, push that button, do this’ — then it actually tends to work pretty well.”
The crew additionally acknowledged that standardized benchmarks for robotics don’t actually exist, which makes exterior validation of their claims troublesome. Instead, the corporate measured π0.7 in opposition to its personal earlier specialist fashions — purpose-built methods educated on particular person tasks — and located that the generalist mannequin matched their efficiency throughout a vary of advanced work, together with making espresso, folding laundry, and assembling packing containers.
What could also be most notable in regards to the analysis — in the event you take the researchers at their phrase — isn’t any single demo however the diploma to which the outcomes stunned them, individuals whose job it is to know precisely what’s within the coaching knowledge and subsequently what the mannequin ought to and shouldn’t be capable to do.
“My experience has always been that when I deeply know what’s in the data, I can kind of just guess what the model will be able to do,” Balakrishna says. “I’m rarely surprised. But the last few months have been the first time where I’m genuinely surprised. I just bought a gear set randomly and asked the robot, ‘Hey, can you rotate this gear?’ And it just worked.”
Levine recalled the second researchers first encountered GPT-2 producing a story about unicorns within the Andes. “Where the heck did it learn about unicorns in Peru?” he says. “That’s such a weird combination. And I think that seeing that in robotics is really special.”
Naturally, critics will level to an uncomfortable asymmetry right here: Language fashions had the whole web to study from. Robots don’t, and no quantity of intelligent prompting totally closes that hole. But when requested the place he expects the skepticism, Levine factors some place else completely.
“The criticism that can always be leveled at any robotic generalization demo is that the tasks are kind of boring,” he says. “The robot is not doing a backflip.” He pushes again on that framing, arguing that the excellence between a formidable robot demo and a robotic system that truly generalizes is exactly the purpose. Generalization, he suggests, will all the time look much less dramatic than a rigorously choreographed stunt — however it is significantly extra helpful.
The paper itself makes use of cautious hedging language all through, describing π0.7 as exhibiting “early signs” of generalization and “initial demonstrations” of new capabilities. These are analysis outcomes, not a deployed product, and Physical Intelligence has been restrained from the beginning about industrial timelines.
When requested immediately when a system based mostly on these findings may be prepared for real-world deployment, Levine declines to take a position. “I think there’s good reason to be optimistic, and certainly it’s progressing faster than I expected a couple of years ago,” he says. “But it’s very hard for me to answer that question.”
Physical Intelligence has raised over $1 billion to this point and was most not too long ago valued at $5.6 billion. A big a part of the investor enthusiasm across the firm traces to Lachy Groom, a co-founder who spent years as considered one of Silicon Valley’s most well-regarded angel buyers — backing Figma, Notion, and Ramp, amongst others — earlier than deciding that Physical Intelligence was the corporate he’d been searching for. That pedigree has helped the startup appeal to severe institutional cash whilst it has refused to supply buyers a commercialization timeline.
The firm is now stated to be in discussions for a new spherical that will practically double that valuation figure to $11 billion. The crew declined to remark.
