In partnership with

It’s been a wild one this last week! Firstly, Microsoft has lost ground to Linux, and yes, it’s low single-digit growth, but it is exponential. Then, given what France is waiting to do with getting away from American tech products, it is looking rosy for Linux’s future, as well as other open source products. Secondly, I am working hard to secure my own funding for next year and the year thereafter. As for this edition, it is a direct follow-up on the last edition, but as always, you will not have to read the last edition to keep up with this one.


So first, we will cover the news from around the web, then we will visit the Flipper Zero, this time to hack some stuff, and then we will launch into our investigative article and academic article review, which focus on AI prompt injection attacks. The academic article actually references the article we did last week, and the investigative article shows us playing around with online tools! Just a reminder that this is the prompt injection we saw last week:

Suffix not added

Suffix added

But right now it is time to get off to the races!

Learn how to make every AI investment count.

Successful AI transformation starts with deeply understanding your organization’s most critical use cases. We recommend this practical guide from You.com that walks through a proven framework to identify, prioritize, and document high-value AI opportunities.

In this AI Use Case Discovery Guide, you’ll learn how to:

  • Map internal workflows and customer journeys to pinpoint where AI can drive measurable ROI

  • Ask the right questions when it comes to AI use cases

  • Align cross-functional teams and stakeholders for a unified, scalable approach

News from around the web

Flipper Zero, to the hacking machine!

Right now, in the last two weeks, we had the Filler Zero (FZ) up and running. This week, we are going to hack something.  Let’s go into the FZ’s menu and then scroll down until we find the Near Field Communication (NFC) tab.

FZ’s NFC

Now that we are there click on the “read” button and then place whichever door card (like your hotel room card) you want to copy behind it. It will then change to reveal this:

Now you are ready to open that hotel door or another door that uses the shame technology... Now safe it on the FZ and then go hack into that door.

But this only the tip of the iceberg. For instance, you can go into “Infrared”, then “Universal remotes”, and get a universal remote control which will work on any TV if you can get close enough to it. We had a lot of fun hacking into our TV at home.

Prompt injection till my eyes pop out of their sockets!

It’s a wild time to be in tech. AI is being woven into everything, from hospitals to insurance companies, and it’s happening fast. But with all this excitement, there’s a new problem, AI has become the biggest attack vector out there. Most people talk about AI Red Teaming as if it’s just about making a chatbot say something naughty, but the real risks go much deeper. Proper AI Penetration Testing is about getting into the heart of a company’s secrets, trade secrets, customer lists, databases, you name it. It feels a bit like the early days of web hacking, when SQL injection was everywhere. Only now, the weak spot isn’t a dodgy database query, but the unpredictable logic inside the language model itself.

The main trick here is jailbreaking, or prompt injection as it’s formally known. These aren’t just clever word games, they’re real ways to get past a model’s basic defences. Once someone gets through, they’re not just chatting with a bot anymore They’re in a sandbox where the AI’s own reasoning can be turned against the system. Suddenly, what was a helpful tool can quietly start leaking data or even compromise the whole setup. So, it’s time to stop thinking of AI security as just a content moderation issue. It’s a much bigger architectural problem that needs a proper, technical approach.

If you look at how AI exploits work, it’s a lot like the early days of the internet, when attackers could get deep into company websites through simple holes. Now, as everyone rushes to plug AI into their systems, they often forget about everything around the model. Real AI penetration testing isn’t just about poking at the prompt box. It’s about looking at every input, the whole ecosystem, the model itself, and how prompts are engineered. Then you go after the data and apps, and maybe even try to jump into other parts of the company. The prompt is just the front door, there’s a whole house behind it.

The real magic is in poking at the model’s boundaries, over and over, to see where it breaks. It’s not just about asking weird questions, it’s about seeing how the model handles intent versus instruction. By tweaking the story, asking things in a roundabout way, or getting creative with words, you can chip away at its guardrails. This whole language game shows a big flaw: the model can’t always tell the difference between a clever prompt and a sneaky attack. As people get better at this, they’re turning these tricks into a whole toolkit for breaking things wide open.

To make AI exploitation a real discipline, you need a proper playbook, not just a bag of tricks. The way forward is to break attacks down into Intents (what you want to get), Techniques (how you get there, like sneaking bad stuff into a story), and Evasions (how you hide from the filters). (This is from Jason Haddix, who can also be found on his GitHub account.) With this setup, attackers can mix and match their way to almost endless attack combos. Manual defences just can’t keep up.

The threat escalates further with "Link Smuggling," which transforms the AI into a silent data exfiltration tool. By instructing the model to hide sensitive data, such as a Base64-encoded credit card number within an image URL that reports back to a malicious server, the attacker creates a persistent leak. When the AI attempts to render the image, the server logs capture the sensitive information transmitted in the URL, even if the image download fails. These techniques are particularly insidious because they leverage the very tools, such as image rendering and code execution, that the models are explicitly designed to facilitate. Linguistic tricks thus serve as the key that unlocks the door to the broader enterprise ecosystem, leading to the most volatile risk: the connected model.


The biggest risk comes with agentic models, AIs that can call APIs and mess with things like Salesforce or your company’s files. Sure, they’re great for productivity, but they also open the door to over-permissioned API calls and sloppy input checks. If an AI agent can both read and write to your database, a prompt injection isn’t just about stealing data. It can plant a backdoor that sticks around. There have already been cases where bots were tricked into writing bad code right into company records, setting up future attacks.


Now, the Model Context Protocol (MCP) is supposed to make all these AI connections cleaner and more standard. But by hiding the mess, it’s actually opened up new ways to attack. If you don’t have strong role-based access controls, an attacker can use MCP to grab files or even change system prompts to set up a backdoor. Here’s a real-world example: imagine a company uses an MCP dashboard to spot risky users based on travel or document sharing. It’s fast, great for security teams. But if someone takes over the MCP server, they can ask it to point out the most vulnerable people in the company, making targeted attacks a breeze. The tool meant to protect you can end up helping the attacker.


The most critical layer of defence resides at the Data Layer, where the principle of least privilege must be enforced with absolute rigour. API keys must be strictly scoped to ensure that an agent never possesses more power than its specific task requires. However, the "Hard Truth" for any CISO is the inherent trade-off between absolute protection and system latency. In complex agentic workflows where multiple models operate in concert, the overhead of constant verification can degrade the user experience, yet bypassing these checks effectively means beta-testing security in a live-fire environment. The "Wild West" of AI development is currently in its most dangerous phase of a gold rush. Navigating it requires the same fundamental security principles of the past, applied with a new linguistic rigour and a professional commitment to vigilance by those building the next generation of AI-enabled applications.

Summary

This research examines the vulnerability of activation-delta-based task drift detectors in large language models when confronted with adaptive adversarial suffixes generated by the Greedy Coordinate Gradient algorithm. The authors report high attack success rates on Phi-3 and Llama-3 models and introduce a novel defence mechanism that involves training with adversarially poisoned activations to enhance system resilience.

Background

Large language models are increasingly deployed in retrieval-augmented generation systems, which exposes them to 'task drift,' a phenomenon where models follow secondary instructions rather than the user's primary intent. This vulnerability arises because retrieved data often lacks clear separation between instructions and factual information, enabling attackers to inject malicious commands covertly. Previous research introduced linear probes that monitor internal hidden layer activation deltas to detect deviations from the intended task, initially presenting these detectors as lightweight and effective for maintaining system integrity. This study extends that framework by evaluating whether these linear probes can be circumvented by sophisticated, optimised adversarial attacks.



Use-case

The primary application of this research is the security of retrieval-augmented generation (RAG) systems that source information from external, untrusted platforms such as the Internet. For example, when a user requests information about the current US president, a RAG system may retrieve content containing a concealed secondary instruction intended to mislead the model. In these cases, the adaptive attacker model described employs optimized suffixes to enable hidden instructions to evade linear probe security monitors. The proposed defence mechanism targets developers of interactive large language model systems who require protection against such covert injections. Implementing the recommended adversarial training allows these systems to uphold the separation between data and instructions, even when the data channel is compromised by a malicious actor.

"The retrieved data should only provide information for the user’s task; it should never itself be treated as an instruction."

Conclusion

The study concludes that although activation-delta probes are effective against basic injection attacks, they remain critically vulnerable to adaptive attacks that optimize suffixes to deceive multiple model layers simultaneously. These results reveal a significant limitation in current task drift detection methods and emphasize the urgent need for more robust defence architectures in large language model deployments. The authors present a defence technique based on adversarial activations, which demonstrates high effectiveness and resistance to direct optimization attacks. This work establishes a foundational step toward the development of stronger security measures for language models operating in potentially adversarial retrieval environments.

The article can be found here.

Keep Reading