Dear Sentinels
What a week it has been! As you may recall, I started at the University of Southampton earlier this month, and my first deliverable is looming, just four days away, not that I’m counting. Of course, as I mentioned when I broke the news about the new job, I wrote this last week, so if anything goes wrong, blame my past self, arg! I’ve also thrown my hat in the ring for five years of funding, fingers crossed, and I’ll keep you posted if the universe is feeling generous. This week, we’re dipping our toes into the world of Large Language Models for robotics. Exciting stuff, and no robots were harmed in the making of this edition.
Vision–language–action (VLA) models are causing quite the stir in the world of robotics. Gone are the days when robots needed separate systems for seeing, thinking, and moving, VLA models bundle it all together in one neat package. The secret sauce? They train on camera feeds, natural language instructions, and the resulting actions, all at the same time. This means our robot friends can finally connect what they see, what you say, and what they’re supposed to do, no more endless lines of task-specific code (and no more late-night debugging sessions, thank goodness). The real magic happens when you feed these models mountains of data from the internet, giving them a surprisingly broad grasp of the world. Add a dash of imitation or reinforcement learning, and suddenly they’re not just moving, but actually figuring things out in the real world. The upshot: robots that aren’t doomed to a life of repetitive tasks, but can adapt on the fly to new places, objects, and jobs, even if they’ve never seen them before.
In the investigative article up next, we’ll take a closer look at how language and motion are coming together in robotics, no interpretive dance required. After that, the academic article will dig into what really matters when building vision–language–action models for generalist robots. But before we get too serious, let’s see what oddities the web has thrown up for us this week.
88% resolved. 22% stayed loyal. What went wrong?
That's the AI paradox hiding in your CX stack. Tickets close. Customers leave. And most teams don't see it coming because they're measuring the wrong things.
Efficiency metrics look great on paper. Handle time down. Containment rate up. But customer loyalty? That's a different story — and it's one your current dashboards probably aren't telling you.
Gladly's 2026 Customer Expectations Report surveyed thousands of real consumers to find out exactly where AI-powered service breaks trust, and what separates the platforms that drive retention from the ones that quietly erode it.
If you're architecting the CX stack, this is the data you need to build it right. Not just fast. Not just cheap. Built to last.
News from around the web!
The Convergence of Language and Motion
Robotics is having a bit of a moment. Gone are the days of robots with separate brains for seeing, thinking, and moving. Enter Vision-Language-Action (VLA) models, which bundle all that into one clever package. Think of it as the Swiss Army knife of robot intelligence. We started with Large Language Models (LLMs) that could juggle words, then moved to Vision-Language Models (VLMs) that could also make sense of pictures. Now, with VLAs, robots can take in text, images, and even their own joint positions, and then figure out exactly how to move. The big idea is to ditch the old, rigid way of programming robots for one job at a time. Instead, we get a flexible system that can handle all sorts of tasks and environments, all with one 'brain' in charge. It’s a bit like giving your Roomba a PhD and a sense of adventure.
Why bother with a unified architecture? Well, it lets robots treat all their different senses as one big pool of information, with a transformer model acting as the brains of the operation. The robot’s camera feeds get squished down into neat little bundles (thanks to a visual backbone), which can then be mixed with language and the robot’s own sense of its limbs. All this gets processed together, so the robot can use its 'world knowledge', gleaned from trawling the internet, not unlike a student before an exam... 🤫 For example, if it’s seen a million pictures of cups, it’ll know what to do with one in a new kitchen, even before it’s moved a muscle.
So, how does a robot go from thinking about a cup to actually picking it up? There are two main tricks. The first is to sneak action commands into the robot’s vocabulary, so it treats 'move arm left' a bit like it would treat a rare word in a novel. The second, fancier method uses a special bit of the model (an 'action head') to turn noisy guesses into smooth, precise movements, think of it as the robot’s equivalent of practising until it gets it right. The clever bit is that all this is guided by the robot’s high-level reasoning, so it doesn’t just flail about. Instead, it moves with purpose, like a caffeinated chess grandmaster.
If you like a good taxonomy (and who doesn’t?), the way robots learn to act has gone through five main stages. First, there were the old-school systems, where seeing, planning, and moving were all separate, and everything had to be hand-tuned, think of it as the robotics equivalent of doing your own taxes. Then came the end-to-end neural networks, where robots learned directly from data, but only for very specific tasks. It was a bit like teaching a dog to fetch only one particular stick and being surprised when it ignores all the others.
Right now, we’re in the middle of what you might call the fine-tuning Revolution. Think of it as the BERT (Bidirectional Encoder Representations from Transformers) moment for robots. By tweaking big, pre-trained models with just a bit of robot-specific data, you can get impressive results without needing a supercomputer or a team of PhDs. Open-source projects like Pi Zero, Gr00t, and SmolVLA are making this possible for the rest of us, not just the big labs. Next up is the dream of universal robot control: one model to rule them all, whether it’s a warehouse bot or a kitchen helper. And finally, the holy grail, plug-and-play robots that can handle new shapes and jobs straight out of the box, just by being told what to do in plain English. It’s the ChatGPT moment for robotics, and yes, it’s as exciting as it sounds. Although this is not here... yet.
The efficacy of these patterns is inextricably linked to data integrity, as the intelligence of a VLA is a direct reflection of its training history. It turns out, robots learn more from their mistakes than from getting everything right the first time, just like the rest of us. So, when collecting data by remote control, it’s actually helpful to let the robot mess up and then show it how to recover. This way, it learns to cope with the real world, which, as we know, is rarely perfect. So, don’t worry about collecting flawless demos; a bit of chaos is good for the curriculum. This accessibility allows practitioners to transition from beginner experimentation to the deployment of sophisticated VLA implementations capable of executing complex, multi-stage tasks.
It’s tempting to compare the rise of VLAs to the story of language models, but robotics has its own set of hurdles, like gravity, dodgy sensors, and the occasional hardware tantrum. Things are moving fast and nobody quite knows where we’ll end up, but the promise is huge. If these models keep improving, we might soon have robots that can understand and act in the real world as smoothly as they read a book. Now, if only they could make a decent cup of tea.
Summary
This paper presents RoboVLMs, a unified framework that efficiently transforms pre-trained Vision-Language Models into high-performing Vision-Language-Action models. This is for generalist robot manipulation tasks through systematic architectural and data analyses. Through more than 600 experiments conducted across simulation and real-world benchmarks, the authors identify critical factors, including backbone selection and history integration, that enable state-of-the-art robotic performance and generalisation.
“In this work, we disclose the key factors that significantly influence the performance of VLA on robot manipulation problems…”
Background
Developing generalisable robot policies that can perceive and interact with physical environments remains a significant challenge in robotics. Vision-Language-Action Models (VLAs) have recently emerged as a promising branch of model-free learning, leveraging robust representations from large-scale pre-trained Vision-Language Models. Trained on web-scale multi-modal data, these models facilitate adaptation to diverse open-world scenes, even with limited robot-specific data. However, transferring these pre-trained backbones into high-performing robot policies requires careful architectural decisions. Persistent challenges include selecting suitable vision-language backbones and formulations that optimally leverage multi-modal representations for robotic control. Understanding how these large-scale models support generalist policies is crucial for advancing autonomous manipulation.

The study addresses three essential design choices: selecting the optimal backbone, formulating effective VLA architectures, and determining the appropriate timing for integrating cross-embodiment data. Modern Vision-Language Models differ substantially in visual encoder structures, fusion mechanisms, and data scales, yet their impact on manipulation performance has not been comprehensively studied. While existing robot learning strategies include model-free, model-based, and world-model-based approaches, VLAs provide unique semantic generality. This research aims to serve as a detailed guide for future VLA design through extensive experiments involving eight backbones and four policy architectures. The RoboVLMs framework enables straightforward integration of new models and flexible combinations of design choices. Ultimately, the work aims to establish VLAs as robust generalist policies by identifying the key factors that drive their performance.
“To utilize Foundation Vision Language Models (VLMs) for robotic tasks and motion planning, the community has proposed different methods for injecting action components into VLMs…”
Use-case
The RoboVLMs framework is evaluated on a diverse set of robotic manipulation benchmarks, including simulation environments such as CALVIN and SimplerEnv, as well as real-world robot platforms. In simulation, the model is tested on multi-task tabletop manipulation, executing consecutive tasks based on natural-language instructions. Simulated tasks include rotating blocks, moving sliders, opening drawers, and placing objects in containers, all of which require precise motor control and semantic understanding. The framework demonstrates strong generalisation when transferring to novel scenes not encountered during training, significantly outperforming previous state-of-the-art policies. Additionally, it is used to benchmark robot policy success rates in private real-world settings via real-to-sim environments. These applications highlight the framework’s effectiveness in evaluating policy robustness and data efficiency across various task horizons.

In real-world experiments, a 7-DoF Kinova Gen3 robot arm equipped with side and wrist cameras performs over 100 distinct manipulation tasks. The models are evaluated on complex skills such as picking and placing objects, pressing buttons, and opening or closing ovens or drawers. A key capability is the model’s robustness to unseen distractors, novel target objects, and varying backgrounds in physical environments. For instance, the system can “pick up a cucumber from a vegetable basket” or “press a toaster switch” even when these scenarios were not present in the training data. The VLA also demonstrates emergent self-correction, enabling the robot to adjust its trajectory if an initial grasp attempt fails. These features make the RoboVLMs framework highly valuable for deploying robots in dynamic, open-world settings where interpreting and acting upon human instructions is essential.
“RoboVLM outperforms the existing VLAs over all settings, especially for metrics in unseen scenarios, demonstrating the effectiveness and robustness of our model.”
Future Work
The paper concludes that VLAs based on pre-trained VLMs are highly effective for generalist robot policies, with backbones such as KosMos and PaliGemma demonstrating superior performance due to extensive pre-training. The most successful architecture integrates multi-step historical observations and continuous actions via a policy head, thereby enhancing both generalisation and data efficiency. Future research will focus on developing generalist policies capable of handling long-horizon, complex instructions, such as “make breakfast,” through step-by-step reasoning. The authors also plan to investigate advanced action tokenisation techniques and more efficient deployment of large models for real-time control. Open-sourcing the RoboVLMs framework and real-world datasets is expected to accelerate progress in foundational robot models within the research community.
“For future work, we envision several potential directions for advancing generalist robot policies… reasoning through executable actions step by step…”
The report can be found here.


