test-driven-development
$
npx mdskill add cloudflare/workspace/test-driven-developmentEnsures code correctness through automated test execution
- Validates logic before implementation to prevent bugs.
- Depends on testing frameworks and assertion libraries.
- Analyzes requirements to determine if tests are needed.
- Generates failing tests that guide minimal correct implementations.
SKILL.md
.github/skills/test-driven-developmentView on GitHub ↗
---
name: test-driven-development
description: Drives development with tests. Use when implementing any logic, fixing any bug, or changing any behavior. Use when you need to prove that code works, when a bug report arrives, or when you're about to modify existing functionality.
---
# Test-Driven Development
## Overview
Write a failing test before writing the code that makes it pass. For bug fixes, reproduce the bug with a test before attempting a fix. Tests are proof — "seems right" is not done. A codebase with good tests is an AI agent's superpower; a codebase without tests is a liability.
## When to Use
- Implementing any new logic or behavior
- Fixing any bug (the Prove-It Pattern)
- Modifying existing functionality
- Adding edge case handling
- Any change that could break existing behavior
**When NOT to use:** Pure configuration changes, documentation updates, or static content changes that have no behavioral impact.
## The TDD Cycle
```
RED GREEN REFACTOR
Write a test Write minimal code Clean up the
that fails ──→ to make it pass ──→ implementation ──→ (repeat)
│ │ │
▼ ▼ ▼
Test FAILS Test PASSES Tests still PASS
```
### Step 1: RED — Write a Failing Test
Write the test first. It must fail. A test that passes immediately proves nothing.
```typescript
// RED: This test fails because createTask doesn't exist yet
describe('TaskService', () => {
it('creates a task with title and default status', async () => {
const task = await taskService.createTask({ title: 'Buy groceries' });
expect(task.id).toBeDefined();
expect(task.title).toBe('Buy groceries');
expect(task.status).toBe('pending');
expect(task.createdAt).toBeInstanceOf(Date);
});
});
```
### Step 2: GREEN — Make It Pass
Write the minimum code to make the test pass. Don't over-engineer:
```typescript
// GREEN: Minimal implementation
export async function createTask(input: { title: string }): Promise<Task> {
const task = {
id: generateId(),
title: input.title,
status: 'pending' as const,
createdAt: new Date(),
};
await db.tasks.insert(task);
return task;
}
```
### Step 3: REFACTOR — Clean Up
With tests green, improve the code without changing behavior:
- Extract shared logic
- Improve naming
- Remove duplication
- Optimize if necessary
Run tests after every refactor step to confirm nothing broke.
## The Prove-It Pattern (Bug Fixes)
When a bug is reported, **do not start by trying to fix it.** Start by writing a test that reproduces it.
```
Bug report arrives
│
▼
Write a test that demonstrates the bug
│
▼
Test FAILS (confirming the bug exists)
│
▼
Implement the fix
│
▼
Test PASSES (proving the fix works)
│
▼
Run full test suite (no regressions)
```
**Example:**
```typescript
// Bug: "Completing a task doesn't update the completedAt timestamp"
// Step 1: Write the reproduction test (it should FAIL)
it('sets completedAt when task is completed', async () => {
const task = await taskService.createTask({ title: 'Test' });
const completed = await taskService.completeTask(task.id);
expect(completed.status).toBe('completed');
expect(completed.completedAt).toBeInstanceOf(Date); // This fails → bug confirmed
});
// Step 2: Fix the bug
export async function completeTask(id: string): Promise<Task> {
return db.tasks.update(id, {
status: 'completed',
completedAt: new Date(), // This was missing
});
}
// Step 3: Test passes → bug fixed, regression guarded
```
## The Test Pyramid
Invest testing effort according to the pyramid — most tests should be small and fast, with progressively fewer tests at higher levels:
```
╱╲
╱ ╲ E2E Tests (~5%)
╱ ╲ Full user flows, real browser
╱──────╲
╱ ╲ Integration Tests (~15%)
╱ ╲ Component interactions, API boundaries
╱────────────╲
╱ ╲ Unit Tests (~80%)
╱ ╲ Pure logic, isolated, milliseconds each
╱──────────────────╲
```
**The Beyonce Rule:** If you liked it, you should have put a test on it. Infrastructure changes, refactoring, and migrations are not responsible for catching your bugs — your tests are. If a change breaks your code and you didn't have a test for it, that's on you.
### Test Sizes (Resource Model)
Beyond the pyramid levels, classify tests by what resources they consume:
| Size | Constraints | Speed | Example |
|------|------------|-------|---------|
| **Small** | Single process, no I/O, no network, no database | Milliseconds | Pure function tests, data transforms |
| **Medium** | Multi-process OK, localhost only, no external services | Seconds | API tests with test DB, component tests |
| **Large** | Multi-machine OK, external services allowed | Minutes | E2E tests, performance benchmarks, staging integration |
Small tests should make up the vast majority of your suite. They're fast, reliable, and easy to debug when they fail.
### Decision Guide
```
Is it pure logic with no side effects?
→ Unit test (small)
Does it cross a boundary (API, database, file system)?
→ Integration test (medium)
Is it a critical user flow that must work end-to-end?
→ E2E test (large) — limit these to critical paths
```
## Writing Good Tests
### Test State, Not Interactions
Assert on the *outcome* of an operation, not on which methods were called internally. Tests that verify method call sequences break when you refactor, even if the behavior is unchanged.
```typescript
// Good: Tests what the function does (state-based)
it('returns tasks sorted by creation date, newest first', async () => {
const tasks = await listTasks({ sortBy: 'createdAt', sortOrder: 'desc' });
expect(tasks[0].createdAt.getTime())
.toBeGreaterThan(tasks[1].createdAt.getTime());
});
// Bad: Tests how the function works internally (interaction-based)
it('calls db.query with ORDER BY created_at DESC', async () => {
await listTasks({ sortBy: 'createdAt', sortOrder: 'desc' });
expect(db.query).toHaveBeenCalledWith(
expect.stringContaining('ORDER BY created_at DESC')
);
});
```
### DAMP Over DRY in Tests
In production code, DRY (Don't Repeat Yourself) is usually right. In tests, **DAMP (Descriptive And Meaningful Phrases)** is better. A test should read like a specification — each test should tell a complete story without requiring the reader to trace through shared helpers.
```typescript
// DAMP: Each test is self-contained and readable
it('rejects tasks with empty titles', () => {
const input = { title: '', assignee: 'user-1' };
expect(() => createTask(input)).toThrow('Title is required');
});
it('trims whitespace from titles', () => {
const input = { title: ' Buy groceries ', assignee: 'user-1' };
const task = createTask(input);
expect(task.title).toBe('Buy groceries');
});
// Over-DRY: Shared setup obscures what each test actually verifies
// (Don't do this just to avoid repeating the input shape)
```
Duplication in tests is acceptable when it makes each test independently understandable.
### Prefer Real Implementations Over Mocks
Use the simplest test double that gets the job done. The more your tests use real code, the more confidence they provide.
```
Preference order (most to least preferred):
1. Real implementation → Highest confidence, catches real bugs
2. Fake → In-memory version of a dependency (e.g., fake DB)
3. Stub → Returns canned data, no behavior
4. Mock (interaction) → Verifies method calls — use sparingly
```
**Use mocks only when:** the real implementation is too slow, non-deterministic, or has side effects you can't control (external APIs, email sending). Over-mocking creates tests that pass while production breaks.
### Use the Arrange-Act-Assert Pattern
```typescript
it('marks overdue tasks when deadline has passed', () => {
// Arrange: Set up the test scenario
const task = createTask({
title: 'Test',
deadline: new Date('2025-01-01'),
});
// Act: Perform the action being tested
const result = checkOverdue(task, new Date('2025-01-02'));
// Assert: Verify the outcome
expect(result.isOverdue).toBe(true);
});
```
### One Assertion Per Concept
```typescript
// Good: Each test verifies one behavior
it('rejects empty titles', () => { ... });
it('trims whitespace from titles', () => { ... });
it('enforces maximum title length', () => { ... });
// Bad: Everything in one test
it('validates titles correctly', () => {
expect(() => createTask({ title: '' })).toThrow();
expect(createTask({ title: ' hello ' }).title).toBe('hello');
expect(() => createTask({ title: 'a'.repeat(256) })).toThrow();
});
```
### Name Tests Descriptively
```typescript
// Good: Reads like a specification
describe('TaskService.completeTask', () => {
it('sets status to completed and records timestamp', ...);
it('throws NotFoundError for non-existent task', ...);
it('is idempotent — completing an already-completed task is a no-op', ...);
it('sends notification to task assignee', ...);
});
// Bad: Vague names
describe('TaskService', () => {
it('works', ...);
it('handles errors', ...);
it('test 3', ...);
});
```
## Test Anti-Patterns to Avoid
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Testing implementation details | Tests break when refactoring even if behavior is unchanged | Test inputs and outputs, not internal structure |
| Flaky tests (timing, order-dependent) | Erode trust in the test suite | Use deterministic assertions, isolate test state |
| Testing framework code | Wastes time testing third-party behavior | Only test YOUR code |
| Snapshot abuse | Large snapshots nobody reviews, break on any change | Use snapshots sparingly and review every change |
| No test isolation | Tests pass individually but fail together | Each test sets up and tears down its own state |
| Mocking everything | Tests pass but production breaks | Prefer real implementations > fakes > stubs > mocks. Mock only at boundaries where real deps are slow or non-deterministic |
## Common Rationalizations
| Rationalization | Reality |
|---|---|
| "I'll write tests after the code works" | You won't. And tests written after the fact test implementation, not behavior. |
| "This is too simple to test" | Simple code gets complicated. The test documents the expected behavior. |
| "Tests slow me down" | Tests slow you down now. They speed you up every time you change the code later. |
| "I tested it manually" | Manual testing doesn't persist. Tomorrow's change might break it with no way to know. |
| "The code is self-explanatory" | Tests ARE the specification. They document what the code should do, not what it does. |
| "It's just a prototype" | Prototypes become production code. Tests from day one prevent the "test debt" crisis. |
| "Let me run the tests again just to be extra sure" | After a clean test run, repeating the same command adds nothing unless the code has changed since. Run again after subsequent edits, not as reassurance. |
## Red Flags
- Writing code without any corresponding tests
- Tests that pass on the first run (they may not be testing what you think)
- "All tests pass" but no tests were actually run
- Bug fixes without reproduction tests
- Tests that test framework behavior instead of application behavior
- Test names that don't describe the expected behavior
- Skipping tests to make the suite pass
- Running the same test command twice in a row without any intervening code change
## Verification
After completing any implementation:
- [ ] Every new behavior has a corresponding test
- [ ] All tests pass: `npm test`
- [ ] Bug fixes include a reproduction test that failed before the fix
- [ ] Test names describe the behavior being verified
- [ ] No tests were skipped or disabled
- [ ] Coverage hasn't decreased (if tracked)
**Note:** Run each test command after a change that could affect the result. After a clean run, don't repeat the same command unless the code has changed since — re-running on unchanged code adds no confidence.
More from cloudflare/workspace
- capnweb|
- debugging-wsd-fuseDebug wsd in real-FUSE mode end-to-end without workerd, vitest-pool-workers, or wrangler in the loop. Boot the linux-x64 binary in a privileged docker container, drive its capnweb /ws endpoint from Node, simulate DO-side sync from a SQLiteTestStorage, and isolate FUSE-related deadlocks. Load when a real-FUSE bug reproduces locally but unit tests pass, when the harness vitest tests hang against a real container, or when you need to attribute a wedge to FUSE vs sync vs exec.
- prose|
- pull-requestsDescribes how to write pull/merge requests. Use when asked to write or edit a pull request or merge request description. This skill is not relevant to commit messages.
- triageHow the TriageAgent should approach a GitHub issue. Load this before deciding whether to attempt a fix or to write up findings.