In partnership with

Dear Sentinels

Well, I may have jinxed the weather last week, no sooner did I send out the last edition than the rain showed up! Anyway, let’s leave the weather behind and dive back into AI. The big question this week: can reinforcement learning actually give us AI that’s up to the job in cyber defence? Short answer: yes, but you know there’s always a catch or two. We’ll dig into all the twists and turns in the investigative piece below. But first, your probably wondering why this edition is going out now? So here's the thing: I wanted to see what happens if I send this out now instead of waiting until Friday. Will the open rates change? The only way to find out is to send it now.


Over in the academic corner, we’re taking a closer look at where reinforcement learning meets large language models. There’s a piece called Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle, and we will dive into it. But first, let’s check out what’s been going on lately and do not forget to check out my second sponsorship:

Unlocked – Your insider access to digital safety.

Sponsored

Unlocked – Your insider access to digital safety.

Your weekly insider access to the latest breaches, cyber threats, and security tips from the experts at EveryKey.

Subscribe

News from around the web!

Know What Matters in Tech Before It Hits the Mainstream

By the time AI news hits CNBC, CNN, Fox, and even social media, the info is already too late. What feels “new” to most people has usually been in motion for weeks — sometimes months — quietly shaping products, markets, and decisions behind the scenes.

Forward Future is a daily briefing for people who want to stay competitive in the fastest evolving technology shift we’ve ever seen. Each day, we surface the AI developments that actually matter, explain why they’re important, and connect them to what comes next.

We track the real inflection points: model releases, infrastructure shifts, policy moves, and early adoption signals that determine how AI shows up in the world — long before it becomes a talking point on TV or a trend on your feed.

It takes about five minutes to read.

The insight lasts all day.

Reinforcement Learning in the Future of Cyber Defence

These days, time isn’t just a factor in cyber warfare, it’s the whole game. Gone are the days when defenders had hours or days to react. Now, attackers roll out automated threats that hit with scary precision. Take Ukraine, for example: while the world was watching Putin’s speech at 4:00 AM on February 24, 2022, the cyber attack had already started. Just over seven minutes before the end of the workday on February 23, HermeticWiper malware was already spreading. The response was impressive, the Slovakian company ESET spotted the signs within ninety minutes, but even that wasn’t fast enough to protect the main targets.


The numbers don’t lie: even with everyone scrambling, there were still twenty-two damaging incidents in just the first week. By week two, things had calmed down a bit, but the main targets had already taken the hit. The takeaway? Malware moves at lightning speed, and even the best human teams are struggling to keep up. So, it’s becoming pretty clear that we can’t just rely on people any more, we need autonomous systems that can move just as fast as the threats.

Cybersecurity standards like the NIST framework focus on identifying, protecting, detecting, responding to, and recovering from threats. Up until now, most of this has been done by people, think defenders poring over threat intel, tweaking settings by hand, and making the call on what to shut down during an attack. But when things are moving at machine speed, waiting around for human approval just doesn’t cut it. If you need to get management to sign off before pulling the plug on a compromised service, you’re already too late. The idea with autonomous agents is simple: let humans set the rules and give feedback, but let the agents make the snap decisions.


There’s a lot of debate in the research world about how much freedom these autonomous agents should have. Some folks want one big agent watching the whole network, while others prefer a bunch of smaller agents, each handling their own little corner. Then there’s the question of where these agents should operate. Most stick to defending their own turf, what’s called Blue Space, but some people are already talking about sending agents into Gray or even Red (offensive) Space to gather intel. That’s risky business, though, since it could look like you’re picking a fight. The sweet spot seems to be letting agents spot trouble and slow down suspicious traffic, giving humans a bit more breathing room without losing the speed advantage.


To achieve this level of autonomy, the field is increasingly turning to Reinforcement Learning (RL), a branch of machine learning that mirrors biological learning. At its core, RL is defined by a continuous loop of observation, action, reward, and adaptation. An agent observes its environment, selects an action, receives a reward or punishment, and modifies its future behaviour to maximise positive results. This loop is typically executed within virtual "gyms", simulated environments that allow for the millions of iterations required for an agent to learn effective strategies.


Most people start out with RL using the classic CartPole game, basically, you’re teaching an agent to balance a stick on a moving cart. RL used to be a real headache to set up, but these days the tools are much friendlier. That means researchers can actually focus on the interesting stuff instead of wrestling with setup. These virtual gyms are a big deal because they let agents practice (and fail) safely before being unleashed on real networks.



Despite the potential for RL in cybersecurity, the domain has followed a historical trajectory that diverged from the broader AI field. In 2013, DeepMind demonstrated RL's power with Atari games, culminating in 2016 when AlphaGo defeated the world champion in Go. However, during this same period, the cybersecurity community remained focused on traditional methods. A poignant example of this divergence is the DARPA Cyber Grand Challenge, which took place just months after the AlphaGo breakthrough. While the AI world embraced deep learning and RL, the systems in the DARPA challenge relied almost exclusively on hard-coded rules.


To build smarter defensive agents, researchers use testbeds like the Cyber Operations Research Gym (CybORG). Here, a 'Blue' agent (the defender) faces off against a 'Red' agent (the attacker) in a pretend network. The Red agents, with names like Beeline or Meander, follow a storyline that’s a lot like real attacks: start with a phishing email, poke around with ping sweeps and port scans, and then look for big vulnerabilities like EternalBlue to break in and cause trouble. To really crank up the difficulty, 'Green' teams are thrown in to act like regular users, adding a bunch of background noise. Suddenly, it’s a lot harder for the agent to tell what’s a real threat and what’s just normal activity. The agent has to sift through all this chaos to spot attacks, set up decoys, or patch things up. Even though it’s all happening in a simulation, it gives us a taste of how autonomous agents might handle the messy, unpredictable world of real networks.


Of course, there’s a big catch: if you build a strong defender, you’re also creating a pretty good attacker. That means you have to protect the offensive side just as much as the defence. Where you put these agents matters, too. Leave them on a local device and someone might steal or reverse-engineer them; put them on a remote server and you could open up new holes. And some tasks, like patching or handling credentials, require much more trust and testing than just flipping a firewall rule.


Autonomous cyber defence is still just getting started, but we’re already seeing the move from toy problems to real-world action. Some people doubt this tech can keep up with the chaos of cybersecurity, but history says things can change fast. Back in 2014, Remi Coulom (the guy behind the top Go program) didn’t think AI would get far anytime soon. Two years later, the world champ was beaten.


The shift from human-led response to machine-speed resilience isn’t just coming, it’s already here, thanks to the breakneck pace of modern threats. By moving from basic automated filters to smart agents that can spot and fix problems on their own, we’re carving out a whole new frontier in defence. Sure, there are big challenges ahead (trust, architecture, you name it), but autonomy is shaping up to be our best shot at keeping up. If organisations don’t start adapting now, they might find themselves left in the dust when the next machine-driven attack rolls in.


Now to the question we asked in the beginning: can reinforcement learning produce AI systems capable of performing complex cyber defence tasks? Yes, reinforcement learning (RL) can produce AI systems capable of performing complex cyber defence tasks. Here’s what RL brings to the table for cyber defence:

  • Real-time threat mitigation: RL lets us hunt down and stop threats as they happen, which is a game-changer in environments where old-school methods just can’t keep up.

  • Self-Adaptive Security: RL-based systems can dynamically adapt to new attack patterns without explicit programming, making them intelligent and self-adaptive.

  • Automated incident response: Deep RL frameworks can help us fine-tune how we respond to incidents, making the whole process faster and smarter.

  • Detection and prevention: RL is already being used to spot, stop, and prevent cyber threats before they do real damage.

  • Boosting general AI: Deep RL is showing real promise in making cybersecurity systems smarter across the board.

Traditional AI and machine learning have already made cyber defence better, but RL is a step up. It brings smarter, learning-based solutions to the mix. That said, it’s not all sunshine and rainbows. Some RL models get too cosy with certain network setups and fall apart when things change. Researchers are working on making these agents more flexible and tough, so keep an eye out for what’s next.


The NIST Cybersecurity Framework can be downloaded here.

Summary

This comprehensive survey systematically reviews advancements and applications of Reinforcement Learning across the lifecycle of Large Language Models, including pre-training, alignment fine-tuning, and reinforced reasoning phases. The survey emphasizes the paradigm of Reinforcement Learning with Verifiable Rewards and provides a structured overview of theoretical foundations, algorithmic developments, datasets, and open-source frameworks.

"This survey aims to present researchers and practitioners with the latest developments and frontier trends at the intersection of RL and LLMs."

Background

Recent advancements in training methods based on Reinforcement Learning have significantly improved the reasoning and alignment performance of Large Language Models. While existing surveys offer overviews of these enhanced models, they often lack a comprehensive examination of reinforcement learning throughout the entire model lifecycle. This survey addresses this gap by elucidating the role of Reinforcement Learning with Verifiable Rewards in advancing model intelligence and security. It further underscores the importance of aligning generative capabilities with human preferences and addresses current limitations in mathematical and logical reasoning.

Historically, Reinforcement Learning from Human Feedback has served as a cornerstone for improving model alignment with user instructions. More recently, the RLVR paradigm has emerged as a key driver for advancing model reasoning through objective feedback signals, such as programmatic checks and proofs of correctness, which incentivise the generation of logically sound solutions. This survey aims to provide researchers with comprehensive guidance on leveraging these techniques to develop next-generation AI assistants.

"RL has been introduced as a powerful framework to address these challenges by directly optimizing model behaviour through interactive feedback and reward signals."

Use-case

The survey outlines application strategies for Reinforcement Learning in several high-impact domains, with a particular focus on improving the performance of reasoning-intensive models such as DeepSeek-R1 and OpenAI-o1. These systems utilize post-training reinforcement-driven reasoning to solve complex problems in mathematics and programming that were previously unattainable with pre-trained models. The survey also examines 'mid-training' scenarios, in which high-quality, task-specific data prepares base models for subsequent reinforcement learning, enabling competitive performance on reasoning benchmarks even when models are initially unsuitable for such training.

In addition to text-based tasks, RLVR techniques are applied to multimodal challenges, including vision-language navigation and video anomaly detection. For example, the Ego-R1 framework enables reasoning over extended egocentric videos by iteratively invoking modular tools to address sub-problems. Another notable application is adaptive reasoning, in which models assess query complexity and dynamically allocate computational resources to either deliberate or respond concisely. These diverse applications demonstrate the versatility of Reinforcement Learning in enhancing the utility and safety of interactive AI agents across multiple modalities.

"RL methods in the 'reinforced reasoning' phase serve as a pivotal driving force for advancing model reasoning to its limits."

Future Work or the Conclusion

The paper concludes that although Reinforcement Learning has substantially improved Large Language Models, significant challenges persist in scalability, training stability, and determining whether RL genuinely produces new reasoning capabilities. Future directions indicate a shift toward process-level supervision and intermediate rewards to address long-horizon credit assignment issues. Growth is also anticipated in specialised domains such as scientific assistants and autonomous agents capable of planning and interacting with external environments. Ultimately, the field is progressing toward a self-reinforcing research cycle in which improved tools and benchmarks drive the development of safer and more intelligent systems.

"The long-term vision is that reinforcement learning... will enable LLMs to not only align with human values but also continuously improve their reasoning through experience."

The report can be found here.

Keep Reading