Dear Sentinels
Well hello there, Sentinels!
Today, we’re taking a leisurely wander through “The Art of Digital Resynthesis” which, if you ask me, is just a posh way of saying we’re having a natter about diffusion models and robotic AI. First up, we’ve got an investigative piece called ‘The Geometry of Perception’, followed by a peek at the academic article “M2Diffuser: Diffusion-based Trajectory Optimization for Mobile Manipulation in 3D Scenes”. Diffusion models are now being used to choreograph entire robot dance routines, so our metallic friends can strut their stuff with a bit more flair and, with any luck, fewer collisions with the coffee table. These models are getting rather good at figuring out how robots move, which is quite useful if you’d prefer your robot to glide gracefully instead of tripping over its own wheels.
Some of the highlights here include Multi-Robot Motion Planning (picture a robot conga line where, miraculously, nobody ends up in hospital), Generalisation Capabilities (train your robot in one spot and then watch it tackle new obstacles), and Robotic Manipulation (essentially, teaching robots to pick things up). Naturally, it’s not all tea and biscuits, getting these models to play nicely with a whole crowd of robots takes a mountain of data.
But before we let the robots take over the world, or at the very least, the living room, let’s see what shiny oddities the internet has coughed up for us this week.
News from around the web!
The best marketing ideas come from marketers who live it.
That’s what this newsletter delivers.
The Marketing Millennials is a look inside what’s working right now for other marketers. No theory. No fluff. Just real insights and ideas you can actually use—from marketers who’ve been there, done that, and are sharing the playbook.
Every newsletter is written by Daniel Murray, a marketer obsessed with what goes into great marketing. Expect fresh takes, hot topics, and the kind of stuff you’ll want to steal for your next campaign.
Because marketing shouldn’t feel like guesswork. And you shouldn’t have to dig for the good stuff.
The Geometry of Perception
To get your head around how modern generative AI works, you first need to meet its favourite playground: Image Space. This is where all the digital magic happens, a sort of invisible stage where machines try to make sense of pictures. For us, an image is just a bunch of shapes and colours that hopefully look like something recognisable. For a computer, though, it’s all about numbers and coordinates in a mind-bogglingly high-dimensional space. The trick is that, for the machine, creating an image isn’t about artistic flair, but about figuring out where to plonk a point in this vast space so it looks like something sensible rather than a digital Rorschach test.
In this world, every image lives somewhere inside a one-million-dimensional hypercube. Yes, you read that right, a million. If you’ve ever wondered what a 1000-by-1000-pixel image looks like to a computer, it’s basically a point in this enormous space, with each pixel getting its own axis. Each pixel can be anything from 0 to 255, so the possibilities are, frankly, ridiculous. While we see a cat or a banana, the computer just sees a set of coordinates. Most of this hypercube is just digital gibberish, only tiny clusters actually look like anything. So, while diagrams make it look like you can hop from one good image to another, in reality, it’s more like searching for a needle in a haystack the size of the universe.
So, that’s the lay of the land. But the real magic of diffusion models is that they don’t just sit around admiring the scenery, they actually move through this million-dimensional wilderness. The trick is something called the forward diffusion process, which is a fancy way of saying the model watches an image slowly dissolve into noise. Before it can build anything, it has to see how things fall apart. By watching information fade away, the model learns how to put it all back together again, sort of like learning to fix a vase by first smashing it (not recommended for actual vases).
This gradual destruction happens step by step, using something called a Markov chain (which is just a posh way of saying each step only cares about the one before it). At each stage, a bit of random noise is sprinkled onto the image, and the process is carefully recorded. It’s not just chaos for chaos’s sake, each bit of noise is like a breadcrumb, showing exactly how the image wandered off into nonsense. By the end, the original picture is lost in a sea of static, but the model has a full record of how it got there, ready to retrace its steps back to something recognisable. This slippery slope into randomness means that any decent image can be mapped all the way to pure noise, building a handy bridge between breaking things down and putting them back together again.
The real breakthrough, though, is in the reverse process. This is where the model tries to find its way back from the chaos of noise to something that actually looks like a picture. If the forward process is tumbling down a hill, the reverse is like climbing back up, except the hill is invisible... The model isn’t uncovering some hidden masterpiece; it’s just following the maths, step by step, to turn static into structure.
At this point, the model works its magic by slowly coaxing an image out of the noise, one step at a time. It’s tempting to imagine the model is just dusting off a hidden cat picture, but in truth, there’s no secret feline lurking in the static. The model simply starts a bit closer to the cat cluster than, say, the banana cluster, and then follows the maths to get there. The journey isn’t a straight line, either, it’s more like a wobbly ramble through a probability jungle, with the model adjusting its route as it goes. The end result is a brand-new image, conjured from scratch, thanks to some clever maths and a fair bit of patience. You can think of it like trying to climb a hill without a map: the model has to feel its way, step by step, towards the peak, which is where the most convincing version of the image lives.
The diffusion model does this by giving instructions to every single pixel at once, a million little nudges, all working together to make the image look better. It’s not a one-and-done job; just like our blind climber, the model has to keep checking its footing and adjusting as it goes. Each step brings the image a bit closer to something you’d actually want to look at. This whole process is possible because the model has been trained to be a sort of satnav for the hypercube, always pointing in the right direction (most of the time, anyway).
What makes diffusion models so efficient is that they teach themselves, without needing humans to label everything. Traditional image classifiers just spit out a label ‘cat’, ‘banana’, or ‘mystery blob’, but diffusion models generate millions of pixel-by-pixel instructions. They do this by taking a clean image, adding some noise, and then learning how to undo their own mess. It’s a bit like a child learning to tidy up after making a mess in the living room, except with more maths and fewer biscuit crumbs.
By practising on loads of images and various noisy versions, the model learns how to navigate from any random spot in the hypercube to find something that actually looks like a picture. Instead of just recognising a cat, it figures out how to get from total chaos to a cat (or whatever else you fancy). It’s like having a map that works no matter where you start, even if you’re hopelessly lost. In the end, this whole approach shows that making detailed images from random noise isn’t about copying human creativity, it’s about finding the hidden order in the maths.
Summary
M2Diffuser is a scene-conditioned diffusion model developed to generate coordinated whole-body motion trajectories for mobile manipulation tasks, utilising robot-centric 3D scans and learned expert trajectory distributions. The integration of physical constraints and task-oriented energy functions during the de-noising process enables the model to reduce execution errors and maintain high precision in both simulated and real world robotic environments.
"In this paper, we introduce M2Diffuser, a diffusion-based, scene-conditioned generative model that directly generates coordinated and efficient whole-body motion trajectories..."
Background
Mobile manipulation presents a significant challenge for generative AI because of high-dimensional action spaces and the need for coordinated navigation and manipulation. Traditional imitation and reinforcement learning methods often fail to eliminate physical constraint violations during inference. These approaches typically require costly new data collection or model retraining to accommodate new task requirements. Additionally, earlier robotic planning methods depended on perfect environment knowledge and engineered goal proposals, limiting their scalability in real-world applications.

While generative AI has achieved notable success in text and image domains, it has not yet mastered complex robotic tasks such as mobile manipulation. This limitation is primarily due to the high dimensionality of the solution space and the stringent requirements for physical precision. Existing neural planners typically model motion as an autoregressive process, which is insufficient for capturing complex trajectory distributions. To address these challenges, the authors introduce a diffusion-based model that frames trajectory optimization as an inference problem, enabling the integration of explicit physical constraints as differentiable cost functions during de-noising.
"primarily due to the high-dimensional action space, extended motion trajectories, and interactions with the surrounding environment"
Use-case
The M2Diffuser framework is applied to three primary mobile manipulation tasks: object grasping, object placement, and goal-reaching within 3D scenes. In grasping scenarios, the robot navigates toward and secures 15 distinct types of target objects, including bottles, books, and bowls. Placement tasks require the robot to transport an object to a designated area with physical plausibility and high overlap accuracy. Goal-reaching tasks involve generating whole-body motions that ensure the robot's end effector reaches a specific target pose without error.
Real-world deployments demonstrate the effectiveness of the model in household environments for object rearrangement and handover tasks. A 10-degree-of-freedom mobile manipulator successfully performed tasks such as retrieving a tea box from a cabinet and delivering it to a table. The model also facilitates human-robot interaction by enabling the handover of items, such as chip bags or books, to seated individuals. These applications leverage robot-centric 3D scans, allowing operation in cluttered and previously unseen environments. The system exhibits strong generalizability across diverse geometric shapes and object categories in both simulation and real-world settings.
"benchmarking on three types of mobile manipulation tasks across over 20 scenes, we demonstrate that M2Diffuser outperforms state-of-the-art neural planners"

Conclusion
The authors conclude that M2Diffuser is the first scene-conditioned motion generator to successfully integrate multiple physical constraints for coordinated whole-body mobile manipulation. Future research will address the slow training and inference speeds inherent to the iterative de-noising process of diffusion models. Planned directions include exploring sampling acceleration algorithms and new noise schedules to reduce inference steps without compromising precision. Additionally, the development of smoother objective functions is intended to overcome current challenges in designing cost functions for multi-stage tasks.
"In future work, we will attempt to solve the aforementioned limitations by exploring the latest advancements in diffusion model acceleration"
The report can be found here.


