SLA & Performance Baseline Document

Document DEPLOY

Purpose

Defines SLA targets, performance baselines, and monitoring thresholds for AI systems.

Related Controls

NIST MG-1

1. System Information

Identify the AI system and its operational context.

System Name: [SYSTEM NAME]

Version: [VERSION]

Owner: [NAME], [ROLE TITLE]

Document Date: [DATE]

Baseline Period: [START DATE] to [END DATE]

Environment: Production

Business Criticality: Low / Medium / High / Critical

2. SLA Definitions

Define service level agreements with measurable targets.

MetricDefinitionTargetMeasurement MethodReporting Frequency
AvailabilityPercentage of time system is operational99.9% (8.7h downtime/year)Health check endpoint monitoringMonthly
Response Time (p50)Median response latency< 500msApplication performance monitoringDaily
Response Time (p99)99th percentile response latency< 2000msApplication performance monitoringDaily
Error RatePercentage of requests returning errors< 0.5%Error tracking / loggingReal-time
Output QualityPercentage of outputs meeting accuracy threshold> 95%Automated evaluation + human samplingWeekly
ThroughputSustained request capacity[X] req/minLoad testing / monitoringMonthly

3. Baseline Measurements

Record actual performance measurements during the baseline period.

MetricBaseline ValueMeasurement PeriodMethodConfidence
Availability[X]%[PERIOD][METHOD]High / Medium
Response Time (p50)[X]ms[PERIOD][METHOD]
Response Time (p99)[X]ms[PERIOD][METHOD]
Error Rate[X]%[PERIOD][METHOD]
Output Quality[X]%[PERIOD][METHOD]
Throughput[X] req/min[PERIOD][METHOD]

Baseline Notes: [Any anomalies, data gaps, or caveats about baseline measurements]

4. Monitoring & Alerting Configuration

Define alert thresholds and notification channels for each metric.

MetricWarning ThresholdCritical ThresholdAlert ChannelEscalation
Availability< 99.95% (5min window)< 99.9% (5min window)PagerDuty + SlackOn-call → Team Lead (15min)
Response Time (p99)> 1500ms> 3000msSlackTeam Lead (30min)
Error Rate> 1%> 5%PagerDuty + SlackOn-call → Team Lead (10min)
Output Quality< 93%< 90%Email + SlackSystem Owner (1hr)
Throughput< 80% of baseline< 50% of baselineSlackOn-call → Team Lead (15min)

Alert Suppression: Scheduled maintenance windows suppress non-critical alerts. Critical alerts are never suppressed.

5. Quarterly Review Log

Track quarterly SLA performance reviews and adjustments.

QuarterSLAs MetSLAs MissedAdjustments MadeReviewerDate
Q1 [YEAR][LIST][LIST][ADJUSTMENTS][NAME][DATE]
Q2 [YEAR]
Q3 [YEAR]
Q4 [YEAR]

Review Process

  1. Collect SLA performance data for the quarter
  2. Identify any breaches and root causes
  3. Assess whether targets are still appropriate (too lenient or too aggressive)
  4. Propose adjustments based on operational experience and business needs
  5. Document decisions and update this document
  6. Present summary to AI Governance Committee
← Back to all templates