SRE Error Budgets: Turning Theory into Practice
Error budgets are the core mechanism that allows SRE teams to balance reliability with feature velocity. Here’s how to implement them effectively.
What Exactly Is an Error Budget?
An error budget is the amount of unreliability you can afford over a period.
Error Budget = 1 - SLA
Example: 99.9% SLA = 0.1% error budget
If your service gets 1,000,000 requests per month:
Monthly Error Budget = 1,000,000 × 0.001 = 1,000 errors
Defining Meaningful SLIs (Service Level Indicators)
Common SLI Patterns
# Prometheus queries for common SLIs
SLI_QUERIES = {
"availability": """
# HTTP availability
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))
""",
"latency": """
# 95th percentile latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
""",
"throughput": """
# Requests per second
sum(rate(http_requests_total[5m]))
""",
"error_rate": """
# Error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
"""
}
SLI Selection Framework
Ask these questions for each candidate SLI:
- Is it user-facing? (Internal metrics don’t count)
- Is it measurable? (Can you collect it reliably?)
- Is it actionable? (Can you improve it when it degrades?)
- Is it stable? (Doesn’t fluctuate wildly under normal conditions)
Implementing Error Budget Calculation
Real-Time Error Budget Dashboard
# error_budget_calculator.py
from datetime import datetime, timedelta
import requests
import pandas as pd
class ErrorBudgetCalculator:
def __init__(self, sli_target: float, period_days: int = 30):
self.sli_target = sli_target # e.g., 0.999 for 99.9%
self.period_days = period_days
self.error_budget = 1 - sli_target
def calculate_consumed_budget(self, prometheus_url: str):
"""Calculate how much error budget has been consumed"""
end_time = datetime.now()
start_time = end_time - timedelta(days=self.period_days)
# Query total requests
total_requests = self._query_prometheus(
prometheus_url,
'sum(increase(http_requests_total[30d]))',
start_time,
end_time
)
# Query error requests
error_requests = self._query_prometheus(
prometheus_url,
'sum(increase(http_requests_total{status=~"5.."}[30d]))',
start_time,
end_time
)
# Calculate consumed budget
error_rate = error_requests / total_requests
consumed_budget = error_rate / self.error_budget
return {
'total_requests': int(total_requests),
'error_requests': int(error_requests),
'error_rate': error_rate,
'error_budget_remaining': max(0, 1 - consumed_budget),
'budget_status': 'HEALTHY' if consumed_budget < 1 else 'EXHAUSTED'
}
Visualization with Grafana
{
"dashboard": {
"panels": [
{
"title": "Error Budget Burn Rate",
"type": "graph",
"targets": [
{
"expr": "((1 - (sum(rate(http_requests_total{status!~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])))) / (1 - 0.999))",
"legendFormat": "Burn Rate"
}
],
"thresholds": [
{"value": 1, "color": "red"},
{"value": 0.5, "color": "yellow"},
{"value": 0, "color": "green"}
]
},
{
"title": "Error Budget Remaining",
"type": "gauge",
"targets": [
{
"expr": "max(0, 1 - ((1 - (sum(rate(http_requests_total{status!~\"5..\"}[30d])) / sum(rate(http_requests_total[30d])))) / (1 - 0.999)))",
"format": "percent"
}
]
}
]
}
}
Error Budget Policies in Action
Policy Levels and Responses
# error_budget_policies.yaml
policies:
- level: "Healthy (0-50% consumed)"
actions:
- "Normal feature development"
- "Canary risky changes"
- "Experiment with new infrastructure"
- level: "Warning (50-80% consumed)"
actions:
- "Increased monitoring"
- "Require senior engineer approval for deployments"
- "Post-incident reviews for all outages"
- "Consider pausing non-critical changes"
- level: "Critical (80-100% consumed)"
actions:
- "Freeze all non-essential changes"
- "Focus exclusively on reliability improvements"
- "Daily reliability meetings"
- "Executive visibility"
- level: "Exhausted (100%+ consumed)"
actions:
- "Complete change freeze"
- "All hands on deck for reliability"
- "Mandatory postmortems for any incidents"
- "Budget reset meeting with product leadership"
Integrating with Deployment Pipelines
# GitHub Actions workflow with error budget check
name: Deploy with Error Budget Check
on:
push:
branches: [main]
jobs:
check-error-budget:
runs-on: ubuntu-latest
outputs:
can-deploy: ${{ steps.check-budget.outputs.can_deploy }}
steps:
- name: Check Error Budget Status
id: check-budget
run: |
BUDGET_STATUS=$(curl -s https://metrics-api/error-budget/status)
if [[ $BUDGET_STATUS == "EXHAUSTED" ]]; then
echo "❌ Error budget exhausted. Deployment blocked."
echo "can_deploy=false" >> $GITHUB_OUTPUT
# Notify Slack
curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
-d '{"text": "Deployment blocked: Error budget exhausted"}'
else
echo "✅ Error budget healthy. Proceeding with deployment."
echo "can_deploy=true" >> $GITHUB_OUTPUT
fi
deploy:
runs-on: ubuntu-latest
needs: check-error-budget
if: needs.check-error-budget.outputs.can_deploy == 'true'
steps:
- name: Deploy Application
run: ./deploy.sh
Common Pitfalls and Solutions
Pitfall 1: Setting Unrealistic Targets
Problem: 99.99% sounds great but costs 10× more than 99.9% Solution: Start with business-critical services at 99.9%, internal tools at 99%
Pitfall 2: Measuring the Wrong Things
Problem: Measuring server uptime instead of user-experienced availability Solution: SLIs must reflect user experience (e.g., API success rate from user locations)
Pitfall 3: Ignoring Seasonality
Problem: Error budget resets monthly, but traffic peaks during holidays Solution: Use rolling windows or adjust budgets for known traffic patterns
Pitfall 4: No Enforcement Mechanism
Problem: Teams ignore exhausted error budgets Solution: Automated deployment blocking and executive visibility
Advanced: Multi-Service Error Budgets
For microservices architectures:
def calculate_composite_error_budget(services):
"""
Calculate error budget for a user journey across multiple services.
For serial dependencies: P(total) = P(service1) × P(service2) × ...
For parallel dependencies: More complex calculation
"""
total_availability = 1.0
for service in services:
service_availability = get_service_availability(service)
total_availability *= service_availability
composite_sla = total_availability
composite_budget = 1 - composite_sla
return {
'composite_sla': f"{composite_sla:.4%}",
'composite_budget': f"{composite_budget:.4%}",
'weakest_link': min(services, key=lambda s: s['availability'])
}
Getting Organizational Buy-In
The Executive Presentation
- Frame it as risk management: “Here’s how much downtime we can afford”
- Show cost implications: “99.99% costs $X, 99.9% costs $Y”
- Present data-driven: Show historical incidents and their budget impact
- Propose a pilot: Start with one critical service, expand gradually
Sample Executive Dashboard Metrics:
- Error budget consumption rate (how fast are we burning through it?)
- Cost of reliability (engineering hours spent on reliability vs features)
- Incident impact (budget consumed by each incident)
- Trend analysis (are we getting more or less reliable over time?)
Conclusion: Making Error Budgets Work
Successful error budget implementation requires:
- Start small: One service, simple SLIs
- Automate everything: Calculation, visualization, enforcement
- Educate constantly: Everyone from engineers to executives
- Iterate: Adjust SLIs, targets, and policies based on experience
- Balance: Remember the goal is enabling innovation, not preventing it
The most important metric? Time spent debating vs. time spent improving. Good error budgets reduce debate and increase action.