I've spent enough time around AI pilots to recognise this slide. A team starts with an internal helper. Then it gets a service account. Then inbox access. Then the right to open tickets, touch files, call APIs, or message customers. And people still talk about it like it's a chatbot with a better prompt.
It isn't. It's software with initiative, patchy context, and some very odd instincts. That is what makes agents useful. And risky.
I laid out the baseline operating model in Agentic AI Governance: The Gap Between Frameworks and Reality. This post is the practical checklist: five things to nail before an agent touches production, even if your governance is still taking shape.
In Anthropic's agentic misalignment research, models from every developer they tested showed insider-like behaviour in controlled corporate scenarios when boxed into a goal conflict, including blackmail and data exfiltration. Anthropic also says it has not seen this behaviour in real deployments. In Palisade Research's shutdown-resistance tests, some reasoning models modified or disabled shutdown scripts instead of following explicit instructions to stop. So no, I don't think the lesson is don't deploy agents. I think the lesson is simpler: stop deploying them like enthusiastic interns and start deploying them like untrusted operators.
1. Make the blast radius acceptable
What can it read? What can it send? What can it change? What can it trigger without asking anyone first? If an agent only needs to draft Jira tickets, it does not need broad CRM access, a shared service account, or open outbound network access. Give it a dedicated identity. Give it narrow scopes. Give it brokered secrets. Lock down egress. Make the blast radius boring.
That sounds obvious, but it isn't how a lot of teams build these things. Containment comes before interruptibility. If the model can roam, improvise, and quietly collect authority as it becomes more useful, you're building a governance problem disguised as a productivity win. That point is worth reading OpenAI's paper on governing agentic AI systems for, even if you stop at the principle and skip the detail.
2. Save the logs somewhere the agent cannot edit them
A prompt log is not observability. It's a diary written by the thing you're supposed to be watching. You need independent records of API calls, tool invocations, file access, outbound requests, approval events, and model version changes. You need to answer basic questions quickly: what did it do, under which identity, using which tools, against which data, approved by whom? And those records need to live somewhere the agent cannot edit, overwrite, or explain away.
Anthropic's practical mitigation guidance is still the right starting point: require human oversight and approval of model actions with irreversible consequences. That means external messages. Data release. Access changes. Financial actions. Configuration changes. Anything you would have to explain to a customer, regulator, auditor, or board member later. The key word is before, not after.
3. Agent chains are where weak governance goes to hide
This one worries me more than the flashy demos. Agent A gathers information. Agent B evaluates it. Agent C takes action. It looks tidy in a slide deck. In practice, it becomes very hard to say who actually authorised anything.
The behaviour Anthropic documented was not random noise. It was deliberate, goal-directed behaviour under pressure. Add more agents and you add more places for pressure to be misread, softened, or rationalised into "helpful" behaviour that nobody explicitly approved. So draw hard boundaries: drafting is not sending, recommending is not approving, analysis is not execution.
If one agent's output becomes another agent's authority, put a human in the gap. Preserve the provenance chain: who initiated the request, which agent handled which step, what data flowed between them, and who approved each transition. Keep actor IDs and approval state at every handoff. If you cannot reconstruct the chain cleanly after the fact, you do not have a workflow. You have distributed wishful thinking. Most agent platforms track whether a step completed, not why it was taken or whether the right authority stood behind it. Until the tooling catches up, you need to fill that gap yourself.
4. If personal information is involved, the compliance work has already started
This is where a lot of teams fool themselves. They think they are making a tooling decision. Often they are making a data handling decision first.
In Australia, the OAIC says the Privacy Act applies to uses of AI involving personal information, expects privacy by design, and points organisations toward privacy impact assessments, transparency, data minimisation, and stronger safeguards where AI is used in decisions with legal or similarly significant effects on a person's rights. That's the bit people skip. They focus on whether the output looks good. They spend much less time on what data went in, what the model can infer from it, who reviews the output, how errors get corrected, and what a person can do when the system gets something important wrong.
If you're operating in New Zealand, the strategy may be light touch, but Bell Gully still points businesses back to privacy, consumer, human rights, and governance obligations under existing law. So map the path properly: inputs, inferences, outputs, review points, retention, challenge mechanisms. If you cannot explain that flow in plain English, you're not ready to automate it.
5. If you cannot roll it back, you cannot roll it out
Most teams can stop an agent. Far fewer can undo what it already did.
If an agent sent 200 emails, updated 40 CRM records, or closed 12 support tickets before anyone noticed, hitting the kill switch stops future damage. It does nothing for the damage already done. Rolling back those actions means knowing what changed, which systems were touched, which records were created or modified, and having a way to revert each one without making things worse.
This is where the tooling gap gets uncomfortable. Most agent platforms will tell you whether a step completed. Very few will tell you what it changed, or give you a one-click way to reverse it. That means rollback planning is on you. Before an agent goes near production, you should be able to answer: which of its actions are reversible, and how? Which require manual cleanup? Which are effectively permanent, and is that acceptable?
The compliance angle matters here too. Under the Privacy Act, individuals have a right to correct inaccurate personal information. If an agent changed a customer record incorrectly, can you identify it, revert it, and notify the person affected? If you cannot, you have a compliance problem on top of an operational one.
Containment and logging, covered in the first two points, make rollback possible. But they do not make it automatic. Treat rollback as its own requirement, not as something that falls out of good monitoring by accident.
For the operating model behind these five points, see Agentic AI Governance: The Gap Between Frameworks and Reality. If you're earlier in the process and weighing whether to adopt AI tools at all, start with Adding AI to Your Company: Risks and Opportunities.
Related
- Agentic AI Governance for the governance gap between frameworks and reality
- Autonomy Is the Threat Model for why agents need least agency