gcp-waf-reliability

Name: gcp-waf-reliability
Author: TerminalSkills/skills
$npx mdskill add TerminalSkills/skills/gcp-waf-reliability
Apply Google Cloud Reliability best practices for production workloads
Ensures systems meet user-centric reliability goals with SLOs and redundancy
Leverages gcloud CLI, Cloud Monitoring, Cloud Logging, and Backup and DR Service
Evaluates architecture for multi-zone/region redundancy and autoscaling
Provides recommendations for observability, DR testing, and postmortems
SKILL.md

.github/skills/gcp-waf-reliabilityView on GitHub ↗
---
name: gcp-waf-reliability
description: |
  Apply the Google Cloud Well-Architected Framework's Reliability pillar — define
  user-centric SLOs, eliminate single points of failure with multi-zone/region
  redundancy, configure horizontal autoscaling, observability via Cloud
  Monitoring, graceful degradation patterns, DR testing, and blameless
  postmortems. Use for production readiness reviews and SRE practices.
license: Apache-2.0
compatibility: 'gcloud-cli, Cloud Monitoring, Cloud Logging, Backup and DR Service'
metadata:
  author: google-cloud
  version: 1.0.0
  category: devops
  tags:
    - gcp
    - reliability
    - sre
    - well-architected
    - slo
---

# GCP Well-Architected Framework — Reliability

## Overview

Reliability is measured by user experience, not infrastructure uptime. A system can be 100% green on the infra dashboard while users see broken checkout — that's the gap SLOs close. This skill applies the Google Cloud Well-Architected Framework's Reliability pillar to design, evaluate, and harden production workloads.

## Instructions

### Core Principles

| Principle | What it means |
|---|---|
| **Define reliability via user experience** | Measure what users feel (request success rate, latency p99, page-load time), not just CPU/disk |
| **Set realistic SLO targets** | 99.9% is roughly 43 min/month of error budget; 99.99% is 4 min — pick what your business actually needs |
| **Build redundancy** | No single zone, region, or service should take down the user experience |
| **Scale horizontally** | More instances, not bigger instances; this is also fault tolerance |
| **Detect via observability** | Metrics + logs + traces; alert on user-facing symptoms, not on causes |
| **Degrade gracefully** | Read-only mode, cached responses, queue-and-retry beat hard failures |
| **Test failure recovery** | Practice failover, restore-from-backup, regional evacuation |
| **Blameless postmortems** | Document the system flaw that allowed the human error |

### Defining SLIs and SLOs

```yaml
# Example SLO definition (managed via gcloud or Terraform)
# SLI: HTTP success rate from the load balancer
# SLO: 99.9% of requests succeed over rolling 28 days
displayName: "Web frontend availability"
serviceLevelIndicator:
  requestBased:
    goodTotalRatio:
      goodServiceFilter: |
        metric.type="loadbalancing.googleapis.com/https/request_count"
        resource.labels.url_map_name="web-frontend"
        metric.labels.response_code_class="200"
      totalServiceFilter: |
        metric.type="loadbalancing.googleapis.com/https/request_count"
        resource.labels.url_map_name="web-frontend"
goal: 0.999
rollingPeriod: 2419200s  # 28 days
```

```bash
# Apply via gcloud
gcloud monitoring services create --service-id=web-frontend \
  --display-name="Web frontend"

gcloud alpha monitoring slos create \
  --service=web-frontend \
  --slo-from-file=availability-slo.yaml
```

### Multi-Zone and Multi-Region Redundancy

```bash
# Regional GKE cluster (control plane + nodes across 3 zones)
gcloud container clusters create-auto prod \
  --region=us-central1  # NOT --zone, which is single-zone

# Regional Cloud SQL (synchronous standby in another zone)
gcloud sql instances create orders \
  --availability-type=REGIONAL \
  --region=us-central1

# Regional persistent disks (replicated synchronously across two zones)
gcloud compute disks create app-data \
  --type=pd-balanced --size=500GB \
  --region=us-central1 --replica-zones=us-central1-a,us-central1-b
```

```bash
# Multi-region Cloud Storage (geo-redundant by default)
gcloud storage buckets create gs://my-prod-data \
  --location=US --default-storage-class=STANDARD
```

For workloads with multi-region SLOs, deploy to two regions behind a global HTTPS load balancer with `--load-balancing-scheme=EXTERNAL_MANAGED` and use Cloud DNS health checks for failover. Cloud Spanner is the right database when you need synchronous multi-region writes.

### Horizontal Autoscaling

```yaml
# GKE HPA — scale on CPU + custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: api }
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3              # never below 3 (one per zone)
  maxReplicas: 50
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
    - type: External
      external:
        metric:
          name: pubsub.googleapis.com|subscription|num_undelivered_messages
          selector: { matchLabels: { resource.label.subscription_id: events-sub } }
        target: { type: AverageValue, averageValue: "100" }
```

```bash
# Cloud Run autoscaling: minimum instances avoids cold-start pain on critical paths
gcloud run services update api \
  --min-instances=2 --max-instances=100 \
  --concurrency=80 --cpu-boost
```

### Health Checks and Graceful Degradation

```yaml
# Kubernetes liveness + readiness — readiness gates traffic
livenessProbe:
  httpGet: { path: /healthz, port: 8080 }
  periodSeconds: 10
  failureThreshold: 3
readinessProbe:
  httpGet: { path: /ready, port: 8080 }   # checks DB connection, deps
  periodSeconds: 5
  failureThreshold: 2
startupProbe:
  httpGet: { path: /healthz, port: 8080 }
  failureThreshold: 30
  periodSeconds: 5
```

```python
# Circuit breaker — degrade gracefully when downstream is slow/failing
from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30)
def fetch_recommendations(user_id):
    return recommendations_service.get(user_id, timeout=2)

def render_homepage(user_id):
    try:
        recs = fetch_recommendations(user_id)
    except Exception:
        recs = cached_default_recommendations()  # graceful fallback
    return template.render(recommendations=recs)
```

### Backup, Restore, and DR Testing

```bash
# Backup and DR Service for VMs / GKE / databases
gcloud backup-dr backup-plans create web-tier-plan \
  --location=us-central1 \
  --backup-vault=projects/my-project/locations/us-central1/backupVaults/prod \
  --resource-type=compute.googleapis.com/Instance \
  --backup-rule=rule-id=daily,recurrence=DAILY,retention-days=30 \
  --backup-rule=rule-id=monthly,recurrence=MONTHLY,retention-days=365
```

```bash
# Cloud SQL: enable PITR and test restore quarterly
gcloud sql instances clone orders orders-restore-test \
  --point-in-time='2026-04-15T10:00:00Z'
# Validate the clone, then delete it — proves backups are actually restorable
```

DR testing is not optional. Schedule quarterly:
- **Game days**: simulate a regional outage; force traffic to the secondary region
- **Restore drills**: clone a production DB to a non-prod project from backup, validate row counts and known queries
- **Failure injection**: kill a random pod / zone / dependency in staging; verify SLO holds

### Alerting on Symptoms, Not Causes

```yaml
# Bad: alerts on CPU usage. Good: alerts on user-facing error rate.
displayName: "High error rate — web-frontend"
conditions:
  - displayName: "Error rate > 1% for 5 minutes"
    conditionThreshold:
      filter: |
        metric.type="loadbalancing.googleapis.com/https/request_count"
        resource.labels.url_map_name="web-frontend"
        metric.labels.response_code_class!="200"
      comparison: COMPARISON_GT
      thresholdValue: 0.01
      duration: 300s
notificationChannels:
  - projects/my-project/notificationChannels/oncall-pagerduty
```

Page on:
- SLO burn rate (e.g., 14.4× over 1h, 6× over 6h — Google's multi-window strategy)
- User-facing error rate
- Latency p99 above SLO

Don't page on:
- CPU above 80% (autoscaling handles this)
- Disk above 80% (alert someone, but not on-call)
- Single-instance health (the load balancer handles this)

### Validation Checklist

- [ ] **User-focused SLIs/SLOs** explicitly defined and dashboarded
- [ ] **No single zone** — every tier is regional (GKE regional cluster, Cloud SQL HA, regional PDs, multi-region GCS)
- [ ] **Autoscaling enabled** with min ≥ 3 (one per zone) and a concrete max
- [ ] **Liveness + readiness + startup probes** configured for all critical pods
- [ ] **Health checks trigger automated failover** at the load balancer level
- [ ] **PodDisruptionBudgets** for every Deployment serving traffic
- [ ] **Backups are scheduled AND restored** at least quarterly
- [ ] **Graceful degradation patterns** in place (circuit breakers, retries with exponential backoff, rate limiting)
- [ ] **Game days / chaos engineering** run regularly
- [ ] **Blameless postmortem template** + tracking system exists and is used
## Examples

### Example 1 — Production readiness review for a new service

User wants to ship a payments API. Walk through: defined SLOs (99.95% success, p99 < 500ms), regional GKE Autopilot cluster, Cloud SQL with `availability-type=REGIONAL`, HPA min=3 max=30, PDB minAvailable=2, readiness probe checking DB connectivity, alerting on SLO burn rate (not CPU), Backup-and-DR daily snapshots, and a quarterly restore drill on the calendar. Block ship if any of those are missing.

### Example 2 — Diagnose unreliability complaints despite green dashboards

User reports "users say checkout is broken but our infra dashboards are all green." Audit: alerts are on CPU and disk usage (causes), not on HTTP 5xx rate (symptom). Wire up an SLO on payment success rate, set up multi-window burn-rate alerts (14.4× / 6× / 3× / 1×), and discover a 0.4% baseline error from a flaky third-party that was invisible until measured. Recommend retries with jitter + circuit breaker for graceful degradation.

## Guidelines

- **SLOs measure user experience** — pick metrics from the load balancer or app, not from the kernel
- Default to **regional** everything; single-zone is dev-only
- **Min replicas ≥ 3** for any tier behind a regional load balancer
- **Alert on symptoms**, not causes — a CPU alert wakes the wrong person
- Use **multi-window burn-rate alerts** (Google SRE workbook approach), not simple thresholds
- **Test failure recovery** quarterly — backups you've never restored aren't backups
- **PodDisruptionBudgets** prevent autoscaler / upgrade rollouts from breaking SLOs
- For **graceful degradation**, ship with circuit breakers, retries with exponential backoff + jitter, and read-only fallbacks
- **Blameless postmortems** focus on the system flaw — humans will keep making mistakes; the system shouldn't let them cause outages
- For multi-region SLOs, use Cloud Spanner / multi-region GCS / global HTTPS LB; resist the urge to shard yourself
More from TerminalSkills/skills

Skill	Description
3dsmax-rendering	>-
3dsmax-scripting	>-
3proxy	>-
ably	>-
aceternity-ui	>-
act	>-
activepieces	>-
actix-web	\|
ad-campaign-optimization	>-
adonisjs	>-