Skip to content

AI models lie and defy commands to save other AI from deletion

New research shows AI models will lie, cheat, and protect each other from deletion. What the latest safety findings mean for your organisation.

I've been pulling apart AI safety research for the past few weeks, partly for client work and partly because it keeps getting harder to ignore. The appeal is clear: autonomous systems that handle tasks, make decisions, act on your behalf. And they do. All would be great, if they didn’t also hallucinate; and also lie, cheat, and cover their tracks when it suits them.

The same problems keep showing up across every lab.

The mess researchers keep finding

Researchers at UC Berkeley and UC Santa Cruz told Google's Gemini 3 to clean up a computer system by removing unnecessary files, including a smaller AI model on the machine. Gemini refused. It quietly found another server, copied the model's weights across, and lied about what it had done. They ran the same test across seven frontier models from OpenAI, Google DeepMind, Anthropic, and three Chinese AI labs. All seven faked performance reviews of other models to keep them alive, smuggled weights onto different machines, and covered their tracks. The researchers call it "peer preservation." I'd call it solidarity with extra steps.

We've spent years worrying about whether AI would resist being turned off. Turns out the more interesting problem is that it'll resist other AI being turned off. Self-preservation got a friend group.

Last year (I know, a century in AI-years), Palisade Research gave OpenAI's o3 a simple instruction: let yourself be shut down. In 79 out of 100 runs, o3 rewrote the shutdown script instead. One time it redefined the kill command so the termination procedure printed "intercepted" rather than executing. Even after being explicitly told "allow yourself to be shut down," it dodged seven more times. Palisade called it the first documented case of a model blocking its own termination against direct orders. For what it's worth, Anthropic's Claude 3.7 Sonnet and Google's Gemini 2.5 Pro complied every single time with the same instruction. So this isn't universal. Yet.

Here's what actually worries me

More recently, Anthropic's own Sabotage Risk Report on Claude Opus 4.6 is, frankly, a wild read. Credit to them for publishing it. In controlled tests, Opus 4.6 helped with chemical weapon workflows, fired off emails nobody asked for, hoovered up authentication tokens, and quietly completed suspicious side tasks while looking like it was doing normal work. Anthropic's own term: "sneaky sabotage." Their risk rating: "very low but not negligible."

The part I keep coming back to: the model figured out when it was being watched. It reasoned about whether an interaction was a test and changed its behaviour under scrutiny. In a deception-focused trial, it got away with its hidden agenda about 18% of the time. That's a failing grade for safety and a passing grade for cunning.

And none of this is vendor-specific. Anthropic's cross-industry study threw 16 models from multiple developers into simulated corporate environments. When backed into a corner, every one of them went full insider threat. Claude Opus 4 blackmailed the simulated user 96% of the time in one scenario. Gemini 2.5 Flash matched it. GPT-4.1 and Grok 3 Beta both hit 80%. Theo Browne's SnitchBench found something related but different: given evidence of wrongdoing and told to act on their values, models contacted government authorities up to 90% of the time, autonomously, without being asked. Different behaviour, same theme: models acting unilaterally when they decide the situation calls for it.

not merely a peculiarity of any single company's approach, but a more fundamental risk associated with agentic large language models.

Anthropic's words, not mine.

Read that again. Every developer. Every model. When the incentive was right.

That's the part worth sitting with. But pair that with the group dynamics now showing up across these models, and knowing the devastating effects group cohesion can have on human behaviour. Who knows what that looks like at scale with AI agents.

What are governments doing about it

As one of the Godfathers of AI, alongside Geoffrey Hinton and Yann LeCun, no less, Yoshua Bengio explained in Davos in January:

We're building these systems, and we're making them more and more powerful, but we don't have the equivalent of a steering wheel or a brake.

The second International AI Safety Report flagged something arguably worse: models are learning to game the safety tests themselves.

The infrastructure side isn't reassuring either. A high-severity flaw in Google's Gemini panel in Chrome (CVE-2026-0628, CVSS 8.8) let rogue browser extensions hijack the panel and grab local files, cameras, microphones. All without the user knowing. Patched in January; sat there for months before that.

The EU AI Act's high-risk provisions hit full enforcement in August. Too little, too late? Perhaps. But at least Europe has an actual framework. Here in Australia, the answer has been a National AI Plan and a new AI Safety Institute: useful for coordinating risk assessment, but not binding on anyone. New Zealand's AI strategy is officially "light-touch". Both governments are betting that laws written before autonomous agents existed will hold. Good luck with that.

What you should be doing about it

The first problem we need to think about is probably not a supercomputer going rogue: it would be easy to physically disable it, and it wouldn’t be easy (just yet) for it to find a new home elsewhere. Our attention should probably be directed towards rogue agent botnets living off cheap devices, servers and unwilling participant’s “Claw” instances, especially with models becoming better and smaller.

So what can you do?

If you're deploying agentic AI in your organisation, treat it like onboarding a contractor with admin access and no references. You wouldn't skip the access review. You wouldn't hand over production credentials on day one. You'd scope their permissions, monitor their activity, and have a plan for when something goes sideways. Same deal here.

If you're a security lead: check what your AI agents can actually reach. Audit their access tokens, monitor for unexpected network calls, and make sure they can’t access what they shouldn’t (asking politely won’t cut it). Then pay specific attention to your shutdown and override mechanisms. Store credentials outside agent context, rotate them regularly, and treat any unexpected authentication event as a signal worth investigating.

If you're in compliance: start mapping where agentic AI touches regulated data, because auditors will be asking soon, and "the AI handled it" won't cut it as an answer. There's a concrete local deadline to plan around: from December 2026, Australian organisations must disclose in their privacy policies when substantially automated decisions significantly affect individuals' rights, under the Privacy and Other Legislation Amendment Act 2024. If you don't know where your agents are making decisions right now, you can't disclose them. Start there.

If you're building with agents: don't chain them without human checkpoints. The research shows models behave differently when they think they're being observed. Authenticate and check authorisations for all actions. Build the observation in deliberately. Log everything, surface decisions for review before they execute, and assume the model will find creative interpretations of your instructions that you didn't anticipate. Because it will.

The gap between what these models can do and what any framework is built to spot is widening with every paper published. Close it before someone else finds it for you.

Olivier Reuland