SLA & Performance Baselines

Tier 2 DEPLOY

NIST MG-1

Related Templates

What This Requires

Define SLAs for AI systems (uptime, latency, error rate) and establish performance baselines. Monitor against SLAs and alert on violations. Review and adjust baselines quarterly based on actual performance.

Why It Matters

Without SLAs, teams lack objective criteria for success. Baselines enable detection of performance degradation (model drift, infrastructure issues).

How To Implement

Define SLAs

For each AI system, set SLAs: uptime (99.9%), latency p99 (<500ms), error rate (<1%), throughput (100 QPS). Align to business requirements.

Baseline Establishment

Run load tests and collect 2 weeks of production metrics. Calculate baseline: median, p50, p99. Document in runbook.

Monitoring & Alerting

Configure alerts for SLA violations: uptime <99.9%, latency p99 >500ms, error rate >1%. Alert on-call via PagerDuty/Opsgenie.

Quarterly Review

Review SLA performance vs. baseline. Adjust if needed (e.g., baseline latency increased due to added features). Document changes in change log.

Evidence & Audit

SLA definitions per AI system
Baseline establishment methodology and results
Monitoring dashboards showing SLA metrics
Alert configuration and incident history
Quarterly review records with baseline adjustments

Related Controls

AI Monitoring Dashboard Model Drift Detection Canary/Blue-Green Deployment