SLA & Performance Baseline Document
Purpose
Defines SLA targets, performance baselines, and monitoring thresholds for AI systems.
Related Controls
1. System Information
Identify the AI system and its operational context.
System Name: [SYSTEM NAME]
Version: [VERSION]
Owner: [NAME], [ROLE TITLE]
Document Date: [DATE]
Baseline Period: [START DATE] to [END DATE]
Environment: Production
Business Criticality: Low / Medium / High / Critical
2. SLA Definitions
Define service level agreements with measurable targets.
| Metric | Definition | Target | Measurement Method | Reporting Frequency |
|---|---|---|---|---|
| Availability | Percentage of time system is operational | 99.9% (8.7h downtime/year) | Health check endpoint monitoring | Monthly |
| Response Time (p50) | Median response latency | < 500ms | Application performance monitoring | Daily |
| Response Time (p99) | 99th percentile response latency | < 2000ms | Application performance monitoring | Daily |
| Error Rate | Percentage of requests returning errors | < 0.5% | Error tracking / logging | Real-time |
| Output Quality | Percentage of outputs meeting accuracy threshold | > 95% | Automated evaluation + human sampling | Weekly |
| Throughput | Sustained request capacity | [X] req/min | Load testing / monitoring | Monthly |
3. Baseline Measurements
Record actual performance measurements during the baseline period.
| Metric | Baseline Value | Measurement Period | Method | Confidence |
|---|---|---|---|---|
| Availability | [X]% | [PERIOD] | [METHOD] | High / Medium |
| Response Time (p50) | [X]ms | [PERIOD] | [METHOD] | |
| Response Time (p99) | [X]ms | [PERIOD] | [METHOD] | |
| Error Rate | [X]% | [PERIOD] | [METHOD] | |
| Output Quality | [X]% | [PERIOD] | [METHOD] | |
| Throughput | [X] req/min | [PERIOD] | [METHOD] |
Baseline Notes: [Any anomalies, data gaps, or caveats about baseline measurements]
4. Monitoring & Alerting Configuration
Define alert thresholds and notification channels for each metric.
| Metric | Warning Threshold | Critical Threshold | Alert Channel | Escalation |
|---|---|---|---|---|
| Availability | < 99.95% (5min window) | < 99.9% (5min window) | PagerDuty + Slack | On-call → Team Lead (15min) |
| Response Time (p99) | > 1500ms | > 3000ms | Slack | Team Lead (30min) |
| Error Rate | > 1% | > 5% | PagerDuty + Slack | On-call → Team Lead (10min) |
| Output Quality | < 93% | < 90% | Email + Slack | System Owner (1hr) |
| Throughput | < 80% of baseline | < 50% of baseline | Slack | On-call → Team Lead (15min) |
Alert Suppression: Scheduled maintenance windows suppress non-critical alerts. Critical alerts are never suppressed.
5. Quarterly Review Log
Track quarterly SLA performance reviews and adjustments.
| Quarter | SLAs Met | SLAs Missed | Adjustments Made | Reviewer | Date |
|---|---|---|---|---|---|
| Q1 [YEAR] | [LIST] | [LIST] | [ADJUSTMENTS] | [NAME] | [DATE] |
| Q2 [YEAR] | |||||
| Q3 [YEAR] | |||||
| Q4 [YEAR] |
Review Process
- Collect SLA performance data for the quarter
- Identify any breaches and root causes
- Assess whether targets are still appropriate (too lenient or too aggressive)
- Propose adjustments based on operational experience and business needs
- Document decisions and update this document
- Present summary to AI Governance Committee