sre-monitoring-and-observability
$
npx mdskill add TheBushidoCollective/han/sre-monitoring-and-observabilityBuild robust systems by tracking latency, traffic, errors, and saturation.
- Detects system health issues through Prometheus metrics analysis.
- Integrates with Prometheus for query execution and data retrieval.
- Calculates SLIs and SLIs based on historical request patterns.
- Outputs actionable alerts and compliance calculations for operators.
SKILL.md
.github/skills/sre-monitoring-and-observabilityView on GitHub ↗
---
name: sre-monitoring-and-observability
description: Use when building comprehensive monitoring and observability systems.
allowed-tools: []
---
# SRE Monitoring and Observability
Building comprehensive monitoring and observability systems.
## Four Golden Signals
### Latency
Time to process requests:
```prometheus
# Request duration
http_request_duration_seconds
# Query
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
```
### Traffic
Demand on the system:
```prometheus
# Requests per second
rate(http_requests_total[5m])
# By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
```
### Errors
Rate of failed requests:
```prometheus
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
# SLI compliance
1 - (error_rate / slo_target)
```
### Saturation
Resource utilization:
```prometheus
# CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
```
## Service Level Indicators (SLIs)
### Availability SLI
```prometheus
# Successful requests / Total requests
sum(rate(http_requests_total{status=~"[23].."}[30d]))
/
sum(rate(http_requests_total[30d]))
```
### Latency SLI
```prometheus
# Requests faster than threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
```
### Throughput SLI
```prometheus
# Requests processed within capacity
clamp_max(
rate(http_requests_total[5m]) / capacity_requests_per_second,
1.0
)
```
## Alerting
### Alert Severity Levels
**P0 - Critical**: Service down or severe degradation
**P1 - High**: Significant impact, error budget at risk
**P2 - Medium**: Degradation, not user-facing yet
**P3 - Low**: Awareness, no immediate action needed
### Example Alerts
```yaml
# High error rate
groups:
- name: sre
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
- alert: LatencyP95High
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
for: 10m
labels:
severity: warning
- alert: ErrorBudgetBurn
expr: |
(1 - sli_availability) > (error_budget_remaining * 10)
for: 1h
labels:
severity: high
```
## Dashboards
### Overview Dashboard
- Service health (red/yellow/green)
- Request rate
- Error rate
- Latency percentiles (p50, p95, p99)
- Saturation metrics
### Detailed Dashboard
- Per-endpoint metrics
- Dependency health
- Database performance
- Cache hit rates
- Queue depths
## Distributed Tracing
### OpenTelemetry
```javascript
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service');
async function handleRequest(req) {
const span = tracer.startSpan('handle_request');
try {
span.setAttribute('user.id', req.user.id);
span.setAttribute('request.path', req.path);
const result = await processRequest(req);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
```
## Structured Logging
```javascript
logger.info('request_processed', {
request_id: req.id,
user_id: req.user.id,
endpoint: req.path,
method: req.method,
status_code: res.statusCode,
duration_ms: duration,
error: error?.message,
});
```
## Best Practices
### USE Method
For resources:
- **Utilization**: % time resource is busy
- **Saturation**: Work queued but not serviced
- **Errors**: Error count
### RED Method
For requests:
- **Rate**: Requests per second
- **Errors**: Failed requests per second
- **Duration**: Request latency distribution
### Alert on Symptoms, Not Causes
```yaml
# Good - alert on user impact
- alert: HighLatency
expr: p95_latency > 1s
# Bad - alert on potential cause
- alert: HighCPU
expr: cpu_usage > 80%
```
### Runbook Links
```yaml
annotations:
runbook: "https://wiki.example.com/runbooks/high-error-rate"
dashboard: "https://grafana.example.com/d/abc123"
```
More from TheBushidoCollective/han
- absinthe-resolversUse when implementing GraphQL resolvers with Absinthe. Covers resolver patterns, dataloader integration, batching, and error handling.
- absinthe-schemaUse when designing GraphQL schemas with Absinthe. Covers type definitions, interfaces, unions, enums, and schema organization patterns.
- absinthe-subscriptionsUse when implementing real-time GraphQL subscriptions with Absinthe. Covers Phoenix channels, PubSub, and subscription patterns.
- act-docker-setupUse when configuring Docker environments for act, selecting runner images, managing container resources, or troubleshooting Docker-related issues with local GitHub Actions testing.
- act-local-testingUse when testing GitHub Actions workflows locally with act. Covers act CLI usage, Docker configuration, debugging workflows, and troubleshooting common issues when running workflows on your local machine.
- act-workflow-syntaxUse when creating or modifying GitHub Actions workflow files. Provides guidance on workflow syntax, triggers, jobs, steps, and expressions for creating valid GitHub Actions workflows that can be tested locally with act.
- ameba-configurationUse when configuring Ameba rules and settings for Crystal projects including .ameba.yml setup, rule management, severity levels, and code quality enforcement.
- ameba-custom-rulesUse when creating custom Ameba rules for Crystal code analysis including rule development, AST traversal, issue reporting, and rule testing.
- ameba-integrationUse when integrating Ameba into development workflows including CI/CD pipelines, pre-commit hooks, GitHub Actions, and automated code review processes.
- analyze-performanceAnalyze performance metrics and identify slow transactions in Sentry