Skip to content

Your AI vendor is now a critical infrastructure provider. Are you treating them like one?

Most companies building on AI APIs have no SLA, no fallback, and no DR plan for their LLM provider. That needs to change.

There's a pattern that keeps coming up whenever I look at how engineering teams are managing their AI dependencies. Teams with tight vendor management programmes, formal SLAs with every SaaS tool they touch, and well-practised runbooks for cloud provider outages... have nothing for their LLM provider. I covered why that matters in Adding AI to your SaaS, but the gap is wider than most teams realise.

Not a fallback model. Not a circuit breaker. Not even a line in the incident response plan.

It's a strange blind spot. When your database goes down, you fail over. When your CDN falls over, you route around it. When your LLM goes down? Your product goes down with it. For however long the provider takes to recover. And you find out the same way your users do.

The uptime numbers are genuinely bad

OpenAI offers no uptime SLA on standard API access. You can get a 99.9% SLA, but only on their Scale Tier: reserved capacity, minimum 30-day commitment. Most teams building on the API are on pay-as-you-go. No commitment, no SLA.

That might feel theoretical until you look at the incident history. In December 2024, a botched telemetry rollout took down ChatGPT, the API, and Sora for over four hours by overwhelming the Kubernetes control plane. Then in June 2025, a global outage lasted 34 hours. All products. All regions. 34 hours.

OpenAI isn't alone. Anthropic's status page shows over 1,100 reported incidents in roughly two years. Claude API uptime for March 2025 (below) was 98.21% availability — over 13 hours of downtime in a single month. For comparison, a 99.9% SLA allows just 43 minutes of downtime per month. Claude was down 18× longer than what most teams consider acceptable for a critical vendor. (source)

Anthropic status page showing incident history

Google's Vertex Gemini API status history shows dozens of documented incidents over the past year affecting Gemini models and AI Studio.

The comparison that matters: a 99.9% SLA, which is what you get from AWS Bedrock or Azure OpenAI, translates to roughly 43 minutes of downtime per month. The measured uptime for direct API providers like OpenAI sits closer to 99.3%, which is five-plus hours of downtime per month. That's a seven-times gap against what your other critical vendors contractually guarantee.

Why this isn't normal third-party risk

In my experience, this is where teams get caught out. Traditional SaaS vendors go down in fragments. A region, a feature, a specific integration. LLM providers tend to fail globally: when the API is down, it's down for everyone, everywhere, at once. There's no geographic failover you can request. No redundant endpoint to switch to.

There's also no graceful degradation by default. If your CRM has a slow API, your sales tool gets sluggish. If your LLM returns a 503, your AI feature just stops. No partial functionality, no degraded mode, unless you built one deliberately.

And the blast radius keeps growing. AI isn't a nice-to-have layer for most teams building on these APIs. For a lot of them, it is the product. As I wrote in Compare LLM Model vs LLM Service, the risk profile of the model and the provider running it are different, and both need managing.

Then there's the risk that never shows up on a status page: model deprecation and behavioural drift. When a provider retires a model version or quietly ships an update, your integration might keep responding to requests just fine while producing subtly different outputs to the ones you designed for. OpenAI rolled back a GPT-4o update in April 2025 after users found the model had shifted to excessively agreeable, validating behaviour, endorsing harmful inputs that earlier versions had pushed back on. The root cause was a training change. The model was still up. The API was still responding. But what it was doing had changed in ways that mattered to users. There's no equivalent in traditional SaaS. Your CRM doesn't one day start responding to queries with a different personality.

Most teams have no plan

Most vendor management programmes have no entry for the LLM provider. No SLA negotiated, no incident communication expectations set, no fallback obligations defined. It doesn't appear on the asset register as a critical dependency. For a lot of teams, it barely appears in the architecture diagram. That's increasingly hard to justify.

Multi-AI is the new multi-cloud

This has a precedent. In the early 2010s, running everything on a single cloud provider was normal practice. Then the outages compounded, and the industry gradually accepted that single-provider dependency was an architecture problem, not just an operations nuisance. Multi-cloud became hygiene: distribute across providers, abstract the infrastructure layer, stop treating one vendor's availability as your own.

I think the same reckoning is coming for AI providers. The barrier is meaningfully lower than it was for cloud. Migrating a cloud provider means re-platforming infrastructure, re-validating configurations, often months of careful work. Swapping an LLM provider, if you've designed for it, is often just a config change. LLM gateways like LiteLLM and PortKey already provide the abstraction layer: a single API surface that routes to whichever provider is available and responding. It's the same pattern as cloud-agnostic tooling in Terraform and Kubernetes, just arriving much faster.

Multi-cloud took about a decade to become architecture hygiene. Multi-AI won't take that long. The outage history is already there. The tooling is already there. What's missing is the decision to treat AI providers as infrastructure and manage them accordingly. I laid out what that governance looks like in Agentic AI Governance: The Gap Between Frameworks and Reality.

What actually helps

Start with the governance layer. Treat your LLM provider as a critical vendor in your third-party risk programme. That means a formal assessment: What's their SLA (or confirmed absence of one)? What's their incident communication process? What's your contractual recourse if they go dark for a day and a half?

Then build the technical layer:

Circuit breaker on all LLM calls. When the provider starts returning errors above your threshold, stop hitting them and return a controlled failure. This prevents retry storms where user retries amplify load on an already-struggling service, turning a partial outage into a total one from your users' perspective.

A fallback model chain. Not necessarily full capability, but something. Claude Sonnet to Claude Haiku to a cached response, or a rule-based fallback that at least tells the user something useful. A defined degraded mode beats silent failure every time.

Pin model versions explicitly. When calling an LLM API, specify the exact model version rather than a 'latest' alias. This gives you time when a provider upgrades: your integration keeps running on the pinned version while you test and validate the behaviour of the new one. Treat model version bumps the same way you'd treat a major dependency upgrade in your codebase: deliberate, tested, not just absorbed. I wrote about the same discipline for MCP server dependencies in Everyone's Installing MCP Servers from GitHub. Nobody's Checking What They Do.

Provider diversification where the risk warrants it. Two independent providers at 99.3% uptime each give you a combined theoretical uptime of roughly 99.995%. The maths is hard to argue with. In practice: route tasks across Anthropic and OpenAI so a single provider outage doesn't take everything with it. Or use OpenRouter to access the same model from multiple providers through a single API endpoint, so switching providers under load is a routing decision, not an engineering project.

None of this requires a massive architectural effort. Most of it is incremental. The harder part is accepting that your LLM provider already is critical infrastructure, and that your vendor management programme needs to catch up to that reality.

Close that gap before a 34-hour outage closes it for you.


If you're evaluating the risks of depending on a single AI provider, start with Compare LLM Model vs LLM Service. For practical steps before deploying AI agents, see Five things to get right before deploying AI agents.

Olivier Reuland