Where to Draw the AI Line in Production Operations

How much of your production operations would you hand to an AI system, and where exactly would you draw the line?

That question is no longer theoretical. AI-driven observability is now standard kit across enterprise estates, and vendor pitch decks have moved on from "intelligent alerting" to autonomous remediation, agentic incident response, and AI-managed reliability. The decision sitting in front of engineering leaders is not whether to use AI in operations. It is where the cut line should sit, who owns the decisions that cross it, and what your organisation loses if that line moves to the wrong place.

This post is the leadership-track companion to our practitioner field notes on AI-augmented SRE. That piece is for engineers running on-call. This one is for the people deciding how on-call gets run.

Why this is now a leader's decision, not a vendor's

The default trajectory in the AIOps market is toward more autonomy. The vendors selling you observability tools have a commercial interest in your AI doing more, your humans doing less, and your renewal value going up. Every demo you sit through is calibrated to make autonomous remediation look like the obvious destination.

It is not. Or rather, it is for some classes of action and emphatically not for others. The choice of which is which sits with you, not with the vendor. Getting it wrong tends not to show up in quarterly reviews. It shows up the first time the AI confidently does the wrong thing in production and no human was in the loop to catch it.

So a clear framework helps. Here is the one we use with clients.

The framework: reversible versus irreversible

The cleanest cut line we have found is reversibility.

Reversible information flows belong to AI. Detecting anomalies across hundreds of thousands of metric series. Correlating deployments, feature flags, configuration changes, and traffic shifts during an incident to compress a forty-cause candidate list down to four. Summarising past incident patterns to give an on-call engineer relevant context in under a minute. Drafting a post-incident timeline from raw event data so senior engineers can focus on the systemic learning rather than the transcription.

The common feature is asymmetric cost. A wrong AI output in any of these categories costs you a few wasted minutes of investigation. A right AI output saves you hours and catches problems earlier. The expected value is positive and the downside is bounded.

Irreversible state changes in production belong to humans. Restarting pods, draining nodes, rolling back releases, cutting traffic, scaling production pools, disabling features. These actions reach into running systems and change their behaviour in ways that can cascade. The asymmetric cost flips: a missed action is usually recoverable in minutes once a human notices, while a wrong action in a complex system can take hours to unwind and may produce side effects you do not understand until the next incident.

AI proposes. Humans dispose. The split is not a temporary limitation to be engineered away as the models improve. It is the correct architecture for systems where the cost of being confidently wrong is high.

The harder problem: novel failures

The framework above handles the obvious cases. The hard ones sit in the middle.

AI systems are well-calibrated on failure modes they have seen before. The trouble starts with novel incidents. We have watched AI root cause analysis confidently report a familiar pattern (database contention, network saturation, garbage collection) on incidents whose actual cause was something the model had no training signal for. The output is fluent, confident, and wrong. Teams chase it for thirty or forty minutes before a senior engineer says "this doesn't smell right" and breaks the chain.

That instinct, the practiced senior judgement that pattern-matches across years of incidents and notices when the AI explanation is too tidy, is the most valuable diagnostic sensor in your stack. It is also the one most easily eroded by AI confidence theatre.

The risk for engineering leaders is subtle. It is not that AI will catastrophically fail and bring down production. It is that AI will be plausibly wrong often enough that your junior engineers stop questioning it, your senior engineers stop being consulted on the strange incidents, and your team's diagnostic muscle quietly atrophies. You will not notice the deficit until you face an incident the AI cannot pattern-match and discover that nobody on the team knows how to think about it any more.

The mitigation is structural, not technical. Senior engineers must remain in the loop on novel incidents specifically, even when the AI has produced a confident-sounding summary. The on-call rotation should be designed so that experience flows into the strange ones, not just the familiar ones.

What this means for how you build the team

The single most useful framing we offer engineering leaders is this: AI in operations is a force multiplier for your senior engineers, not a substitute for them.

The value of well-implemented AIOps is not that it replaces hiring. It is that it removes the toil that prevents your senior engineers from doing the work only they can do. They stop drowning in alert review and start spending time on architecture, on coaching, and on the genuinely hard incidents.

The teams that struggle are the ones who interpret the same AI capability as a hiring deferral. They cut SRE headcount on the assumption the platform now does the work. Then a novel incident hits, the AI is confidently wrong, and the team discovers that the seniors who would have caught it are no longer in the building. By the time you realise this, your reliability has degraded in ways that do not show up on a dashboard.

If you are an engineering leader sizing your reliability investment, the right question is not "can AI let us run with fewer senior engineers" but "can AI let our senior engineers spend more of their time on the work that matters." The answer to the first is mostly no. The answer to the second is mostly yes.

Practical questions to ask

When you are evaluating AIOps capability or auditing what you already have, the framework above translates into questions you can put directly to your team or your vendor:

For each AI-driven capability, what does failure look like? Reversible failures are fine. Anything that can take an irreversible action without a human in the loop needs a very high bar of justification.
What is the team's calibrated trust in the AI for novel incidents? If the answer is "we trust it", that is itself the problem. The right answer is "we treat it as a hypothesis generator."
Are senior engineers still seeing the strange incidents? Or has the AI become the first responder on everything, with seniors only pulled in when humans escalate?
What is the cost of a wrong automated action versus a missed automated action? Have your team and vendor both costed this honestly. Auto-remediation that fixes nine real incidents and breaks one production system in a way that costs you a day of customer trust is not a win.
What does the team actually do with the time AI saves them? If the answer is mostly "they do less", you bought the wrong tool. If the answer is "they invest the time in reliability work the AI cannot do", you bought the right one.

The bottom line

The AI line in your production operations should sit at reversibility, with humans on the irreversible side and senior judgement explicitly maintained on the novel-failure conversations. That cut line is not a stopgap until the models improve. It is the correct architecture for production systems where confidence and accuracy are not the same thing.

AI in operations is genuinely valuable. It is also genuinely dangerous to delegate too far. The difference is leadership.

If you are weighing how to draw that line in your own operations, get in touch. We have helped Australian enterprises sort the genuine wins from the vendor theatre, and we are happy to do the same for yours.