exercise_cluster
$
npx mdskill add multigres/multigres-operator/exercise_clusterTests and validates MultigresCluster health through real-world scenarios
- Solves the problem of finding bugs in operator and upstream multigres
- Uses kubectl, observer, and CRD structures to manage cluster states
- Decides actions based on cluster configuration and mutation scenarios
- Reports health status and errors using the observer for accurate validation
SKILL.md
.github/skills/exercise_clusterView on GitHub ↗
---
name: exercise_cluster
description: Deploy MultigresCluster fixtures, run mutation scenarios, and validate health using the observer. Finds bugs in the operator and upstream multigres by exercising real cluster operations and verifying end-to-end health beyond CRD phase status. Use this skill whenever the user wants to test the operator, exercise the cluster, run exerciser scenarios, validate cluster health after changes, find bugs through mutation testing, or deploy and mutate fixtures.
---
# Exercise Cluster Skill
**Goal:** Find bugs in the multigres operator and upstream multigres by deploying real MultigresCluster configurations, mutating them through operator-driven workflows, and using the observer to verify true end-to-end cluster health.
**Core principles:**
- The **observer is the single source of truth** for cluster health. CRD phase `Healthy` is necessary but NOT sufficient — it misses broken replication, connection failures, and multi-primary states.
- **You drive kubectl directly.** Read the live CR, understand its structure, construct correct patches. Fixtures have different structures (`.spec.pools` vs `.overrides.pools`).
- **Every post-grace-period error is potentially a real bug.** NEVER dismiss errors — operator logs, RBAC warnings, webhook warnings, kubectl output. Investigate everything, report everything, including transient findings that resolved.
## Phase 0: Cluster Setup
Verify the kind cluster and observer are running:
```bash
KUBECONFIG=$(pwd)/kubeconfig.yaml kubectl cluster-info
KUBECONFIG=$(pwd)/kubeconfig.yaml kubectl get pods -l app.kubernetes.io/name=multigres-observer -n multigres-operator
```
If cluster is down: `make kind-deploy`. If only observer is missing: `make kind-deploy-observer`.
Define the observer helper:
```bash
observer() {
KUBECONFIG=$(pwd)/kubeconfig.yaml kubectl exec -n multigres-operator deploy/multigres-observer -- curl -sf "http://localhost:9090$1"
}
```
Verify: `observer /api/status | jq '.summary'`
## Phase 1: Deploy Fixture & Baseline
Pick a fixture from **Fixture Selection** below. For topology fixtures, read `references/topology-awareness.md` first.
1. Deploy prerequisites if they exist, wait for pods to be Running:
```bash
KUBECONFIG=$(pwd)/kubeconfig.yaml kubectl apply -f fixtures/<fixture>/prerequisites.yaml
```
2. Deploy the cluster. **Read the kubectl output** — webhook warnings mean real problems. Stop and fix before proceeding.
```bash
KUBECONFIG=$(pwd)/kubeconfig.yaml kubectl apply -f fixtures/<fixture>/cluster.yaml
```
3. Run the **Stability Verification Protocol** (below) with tier `lifecycle`. Baseline must be fully clean — any error is a bug.
4. For template/override fixtures: run `references/template-verification.md`.
## Phase 2: Mutation Testing
Consult `references/scenarios/index.md` for the full scenario catalog. For each scenario:
1. Read the live CR: `KUBECONFIG=$(pwd)/kubeconfig.yaml kubectl get multigrescluster <name> -n <ns> -o yaml`
2. Save pre-mutation state for teardown.
3. Construct and apply the correct patch based on the actual CR.
4. Verify using the appropriate protocol:
- Fast-path eligible → `references/fast-path-verification.md`
- Concurrent mutations → `references/concurrent-mutations.md`
- Negative assertions → `references/negative-assertions.md`
- All others → **Stability Verification Protocol** below
5. Log results, teardown if applicable, verify stability again. Proceed only after confirmed stable.
## Stability Verification Protocol
Run after EVERY cluster change: deploy, mutation, teardown.
### Tiers
| Tier | When | CRD Timeout | Min Observation |
|------|------|-------------|-----------------|
| `quick` | Config-only (annotations, PVC policy) | 3 min | 60s |
| `standard` | Scale, resources, images | 5 min | 60s |
| `lifecycle` | Deploy, delete-recreate, template switches | 10 min | 90s |
### Step 1 — CRD Phase Gate
Poll `.status.phase` every 5s until `Healthy`:
```bash
KUBECONFIG=$(pwd)/kubeconfig.yaml kubectl get multigrescluster <name> -n <ns> -o jsonpath='{.status.phase}'
```
If `Degraded`/`Failed` persists >2 min or timeout reached → STOP, investigate.
### Step 2 — Grace Period
The observer suppresses pool pod errors for 2 min after creation. Wait until ALL pool pods are at least **150s old**:
```bash
KUBECONFIG=$(pwd)/kubeconfig.yaml kubectl get pods -n <ns> -l app.kubernetes.io/component=shard-pool \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.creationTimestamp}{"\n"}{end}'
```
### Step 3 — Stability Observation
Poll every 10s, tracking: `consecutive_clean` (polls with 0 errors/fatals), `all_findings` (every error/fatal seen), `elapsed`.
```bash
observer /api/status | jq '{
summary: .summary,
errors: [(.findings // [])[] | select(.level == "error" or .level == "fatal")],
warns: [(.findings // [])[] | select(.level == "warn")]
}'
```
**Exit conditions:**
| Condition | Action |
|-----------|--------|
| `consecutive_clean >= 3` AND `elapsed >= min_observation` | **STABLE** |
| Error persists > 3 min | **UNSTABLE** — investigate |
| `elapsed >= 5 min` without stability | **TIMEOUT** — investigate |
A finding "persists" when the same check+component appears in consecutive polls. Use `/api/history` to classify.
### Step 4 — Result Classification
- **STABLE (clean)**: No findings during observation.
- **STABLE (transients observed)**: Errors appeared then resolved. List each with check, component, message, duration. Report these — they may indicate intermittent bugs.
- **UNSTABLE**: Persistent errors. Investigate via `references/deep-investigation.md`.
### Step 5 — Post-Stability Checks
Run all three after stability is confirmed:
**Warn review:**
```bash
observer /api/status | jq '[(.findings // [])[] | select(.level == "warn")]'
```
Note persistent warnings (replication lag, WAL replay paused).
**Primary verification** — catches stale podRoles bugs:
```bash
# Exactly 1 PRIMARY per pool per cell in CRD
kubectl get shard -n <ns> -o json | jq '[.items[].status.podRoles | to_entries[] | select(.value == "PRIMARY")] | length'
# Cross-reference with actual PG state
for pod in $(kubectl get pods -n <ns> -l app.kubernetes.io/component=shard-pool -o name); do
echo -n "$pod: "
kubectl exec -n <ns> $pod -c postgres -- psql -h 127.0.0.1 -p 5432 -U postgres -tAc "SELECT CASE WHEN pg_is_in_recovery() THEN 'REPLICA' ELSE 'PRIMARY' END"
done
```
CRD PRIMARY must match SQL `pg_is_in_recovery() = false`. Mismatch → trigger reconcile, report as "error".
**Observer history assertion:**
```bash
observer /api/history | jq '{persistent: .persistent, flapping: .flapping, transientCount: (.transient | length)}'
```
Assert `persistent == []` and `flapping == []`. Investigate before proceeding if not.
## Execution Modes
| Mode | What it does | When to use |
|------|-------------|-------------|
| **smoke** | Deploy → baseline verification. Template fixtures include TVP. | Quick sanity check |
| **core** | smoke + scale-up, scale-down, update-resources, delete-pool-pod | Standard coverage |
| **full** | All applicable scenarios including concurrent, webhooks, negatives | Thorough testing |
Default to **core** when unspecified.
## Fixture Selection
| Fixture | Kind-Ready | TVP | Tests |
|---------|-----------|-----|-------|
| `minimal-retain` | Yes | — | Core logic, PVC retention |
| `minimal-delete` | Yes | — | PVC deletion paths |
| `templated-full` | Yes (prereqs) | Full | Template resolution |
| `overrides-complex` | Yes (prereqs) | Override | Override merging |
| `external-etcd-mixed` | Yes (prereqs) | — | External topology server |
| `s3-backup` | Needs real S3 | — | Backup with S3 |
| `multi-cell-quorum` | Yes (heavy) | — | Multi-cell, quorum |
| `postgres-config-ref` | Yes (prereqs) | — | ConfigMap-based postgresql.conf, rolling update on content change |
| `external-adminweb` | Yes | — | External admin web IPs, annotations, status |
| `multi-cell-topology` | `kind-deploy-topology` | — | Zone-aware scheduling |
| `observability-custom` | Yes (prereqs) | — | Custom observability |
Prerequisites are self-contained (except `s3-backup`). Deploy prereqs first, wait for pods Running.
**Recommended order:** `minimal-retain` → `minimal-delete` → `templated-full` → `overrides-complex` → `external-etcd-mixed` → `multi-cell-topology`
## Reporting
Create report at `agent-docs/exerciser/exercise-run-<YYYY-MM-DD-HHMMSS>.md`:
- Environment (cluster, observer, operator image, multigres images)
- Per-fixture: baseline, each scenario (mutation, stability result, all findings, teardown)
- Summary: fixtures tested, scenarios run, bugs found, transients observed
## Reference Documents
| Reference | When to Read |
|-----------|-------------|
| `references/scenarios/index.md` | Before mutations — master scenario lookup with files, tiers, fixtures |
| `references/scenarios/core.md` | Core mode scenarios (scale, resources, delete-pod) |
| `references/scenarios/*.md` | Load specific scenario files as directed by the index |
| `references/operator-knowledge.md` | When investigating bugs |
| `references/fast-path-verification.md` | For fast-path eligible scenarios |
| `references/template-verification.md` | After deploying template/override fixtures |
| `references/negative-assertions.md` | For deletion, rejection, cleanup scenarios |
| `references/deep-investigation.md` | When UNSTABLE |
| `references/concurrent-mutations.md` | Full mode concurrent testing |
| `references/topology-awareness.md` | Topology fixtures or topology warnings |
| `patches/` | Reusable mutation scripts |
More from multigres/multigres-operator
- diagnose_with_observerUse the multigres observer to diagnose cluster health issues. Fetch structured diagnostics from the /api/status endpoint, triage findings by severity, correlate root causes, and produce actionable bug reports. Use this whenever the user reports cluster problems, wants to investigate observer findings, needs to debug multigres issues, asks about cluster health, or sees errors in operator or data plane logs.
- generate_commit_messagegenerate semantic git commit messages
- pin_upstream_imagesPin multigres container image tags in image_defaults.go for operator releases. Compares upstream multigres code changes between the current and new SHA, highlights breaking changes and new features, then updates the tags. Triggered by user requests like "prepare images for release", "pin image tags", "pin upstream images", or "upgrade multigres images".