sre-reliability-engineering
$
npx mdskill add TheBushidoCollective/han/sre-reliability-engineeringDefine SLOs and error budgets for distributed systems.
- Establishes availability targets and calculates error budgets.
- Integrates Prometheus for metrics collection and SLI definitions.
- Decides actions based on budget exhaustion thresholds.
- Outputs markdown templates and Prometheus queries.
SKILL.md
.github/skills/sre-reliability-engineeringView on GitHub ↗
---
name: sre-reliability-engineering
user-invocable: false
description: Use when building reliable and scalable distributed systems.
allowed-tools: []
---
# SRE Reliability Engineering
Building reliable and scalable distributed systems.
## Service Level Objectives (SLOs)
### Defining SLOs
```
SLI: Availability = successful requests / total requests
SLO: 99.9% availability (measured over 30 days)
Error Budget: 0.1% = 43 minutes downtime per month
```
### SLO Document Template
```markdown
# API Service SLO
## Availability SLO
**Target**: 99.9% of requests succeed (measured over 30 days)
**SLI Definition**:
- Success: HTTP 200-399 responses
- Failure: HTTP 500-599 responses, timeouts
- Excluded: HTTP 400-499 (client errors)
**Measurement**:
```prometheus
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total{status!~"4.."}[30d]))
```
**Error Budget**: 0.1% = ~43 minutes/month
**Consequences**:
- Budget remaining > 0: Ship features fast
- Budget exhausted: Feature freeze, focus on reliability
- Budget at 50%: Increase caution
```
## Error Budgets
### Tracking
```prometheus
# Error budget remaining
error_budget_remaining = 1 - (
(1 - current_sli) / (1 - slo_target)
)
# Example: 99.9% SLO, currently at 99.95%
# Error budget remaining = 1 - ((1 - 0.9995) / (1 - 0.999))
# = 1 - (0.0005 / 0.001) = 0.5 (50% remaining)
```
### Burn Rate
```prometheus
# How fast are we consuming error budget?
error_budget_burn_rate =
(1 - current_sli_1h) / (1 - slo_target)
# Alert if burning budget 10x faster than sustainable
- alert: FastErrorBudgetBurn
expr: error_budget_burn_rate > 10
for: 1h
```
### Policy
```
Error Budget > 75%: Ship aggressively
Error Budget 25-75%: Normal velocity
Error Budget < 25%: Slow down, increase testing
Error Budget = 0%: Feature freeze, reliability only
```
## Reliability Patterns
### Circuit Breaker
```javascript
class CircuitBreaker {
constructor({ threshold = 5, timeout = 60000 }) {
this.state = 'CLOSED';
this.failures = 0;
this.threshold = threshold;
this.timeout = timeout;
}
async call(fn) {
if (this.state === 'OPEN') {
if (Date.now() - this.openedAt > this.timeout) {
this.state = 'HALF_OPEN';
} else {
throw new Error('Circuit breaker is OPEN');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failures = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failures++;
if (this.failures >= this.threshold) {
this.state = 'OPEN';
this.openedAt = Date.now();
}
}
}
```
### Retry with Exponential Backoff
```javascript
async function retryWithBackoff(fn, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await fn();
} catch (error) {
if (i === maxRetries - 1) throw error;
const delay = Math.min(1000 * Math.pow(2, i), 10000);
const jitter = Math.random() * 1000;
await sleep(delay + jitter);
}
}
}
```
### Rate Limiting
```javascript
class TokenBucket {
constructor({ capacity, refillRate }) {
this.capacity = capacity;
this.tokens = capacity;
this.refillRate = refillRate;
this.lastRefill = Date.now();
}
tryConsume(tokens = 1) {
this.refill();
if (this.tokens >= tokens) {
this.tokens -= tokens;
return true;
}
return false;
}
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
const tokensToAdd = elapsed * this.refillRate;
this.tokens = Math.min(
this.capacity,
this.tokens + tokensToAdd
);
this.lastRefill = now;
}
}
```
### Bulkhead
```javascript
class Bulkhead {
constructor({ maxConcurrent }) {
this.maxConcurrent = maxConcurrent;
this.current = 0;
this.queue = [];
}
async execute(fn) {
while (this.current >= this.maxConcurrent) {
await new Promise(resolve => this.queue.push(resolve));
}
this.current++;
try {
return await fn();
} finally {
this.current--;
if (this.queue.length > 0) {
const resolve = this.queue.shift();
resolve();
}
}
}
}
```
## Graceful Degradation
```javascript
async function getRecommendations(userId) {
try {
// Try personalized recommendations
return await recommendationService.getPersonalized(userId, {
timeout: 500, // Fail fast
});
} catch (error) {
logger.warn('Personalized recommendations failed, falling back', {
userId,
error: error.message,
});
try {
// Fall back to popular items
return await cache.get('popular_items');
} catch (fallbackError) {
// Final fallback
return DEFAULT_RECOMMENDATIONS;
}
}
}
```
## Capacity Planning
### Utilization Tracking
```prometheus
# Current utilization
current_utilization =
sum(rate(http_requests_total[5m]))
/ capacity_requests_per_second
# Alert when approaching capacity
- alert: HighUtilization
expr: current_utilization > 0.80
for: 10m
```
### Growth Projection
```
Current QPS: 1,000
Growth rate: 20% per month
Capacity per instance: 100 QPS
Current instances: 12
In 6 months:
Projected QPS: 1,000 * (1.20)^6 = 2,986
Instances needed: 2,986 / 100 = 30
```
### Load Testing
```javascript
// k6 load test
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 100 }, // Ramp up
{ duration: '5m', target: 100 }, // Steady state
{ duration: '2m', target: 200 }, // Spike
{ duration: '5m', target: 200 }, // Higher steady
{ duration: '2m', target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95% under 500ms
http_req_failed: ['rate<0.01'], // Less than 1% errors
},
};
export default function () {
const res = http.get('https://api.example.com/endpoint');
check(res, {
'status is 200': (r) => r.status === 200,
'response time < 500ms': (r) => r.timings.duration < 500,
});
sleep(1);
}
```
## Chaos Engineering
### Fault Injection
```javascript
// Inject latency
function withLatencyInjection(fn, { probability = 0.1, delayMs = 1000 }) {
return async (...args) => {
if (Math.random() < probability) {
await sleep(delayMs);
}
return fn(...args);
};
}
// Inject failures
function withFailureInjection(fn, { probability = 0.05 }) {
return async (...args) => {
if (Math.random() < probability) {
throw new Error('Injected failure');
}
return fn(...args);
};
}
```
## Best Practices
### Design for Failure
- Assume all dependencies can fail
- Have fallback options
- Fail fast and timeout quickly
- Implement retries with backoff
### Measure User Impact
- SLOs should reflect user experience
- Don't alert on internal metrics alone
- Track real user monitoring (RUM)
### Balance Velocity and Reliability
- Use error budgets to make decisions
- Don't target 100% reliability
- Spend error budget on innovation
### Automate Everything
- Automate deployments
- Automate rollbacks
- Automate capacity scaling
- Automate incident response
More from TheBushidoCollective/han
- absinthe-resolversUse when implementing GraphQL resolvers with Absinthe. Covers resolver patterns, dataloader integration, batching, and error handling.
- absinthe-schemaUse when designing GraphQL schemas with Absinthe. Covers type definitions, interfaces, unions, enums, and schema organization patterns.
- absinthe-subscriptionsUse when implementing real-time GraphQL subscriptions with Absinthe. Covers Phoenix channels, PubSub, and subscription patterns.
- act-docker-setupUse when configuring Docker environments for act, selecting runner images, managing container resources, or troubleshooting Docker-related issues with local GitHub Actions testing.
- act-local-testingUse when testing GitHub Actions workflows locally with act. Covers act CLI usage, Docker configuration, debugging workflows, and troubleshooting common issues when running workflows on your local machine.
- act-workflow-syntaxUse when creating or modifying GitHub Actions workflow files. Provides guidance on workflow syntax, triggers, jobs, steps, and expressions for creating valid GitHub Actions workflows that can be tested locally with act.
- ameba-configurationUse when configuring Ameba rules and settings for Crystal projects including .ameba.yml setup, rule management, severity levels, and code quality enforcement.
- ameba-custom-rulesUse when creating custom Ameba rules for Crystal code analysis including rule development, AST traversal, issue reporting, and rule testing.
- ameba-integrationUse when integrating Ameba into development workflows including CI/CD pipelines, pre-commit hooks, GitHub Actions, and automated code review processes.
- analyze-performanceAnalyze performance metrics and identify slow transactions in Sentry