data-pipeline-builder

$npx mdskill add OneWave-AI/claude-skills/data-pipeline-builder

Design and implement production-grade ETL/ELT data pipelines: take data sources, a destination, and transformation requirements, then produce a complete pipeline specification plus all implementation files needed to run it.

SKILL.md
.github/skills/data-pipeline-builderView on GitHub ↗
---
name: data-pipeline-builder
description: Designs and builds ETL/ELT data pipelines. Takes data sources, destination, transformation requirements. Generates pipeline code (Python/SQL), scheduling config, error handling, monitoring setup, and data quality checks. Outputs data-pipeline-spec.md + implementation files.
tools: Read, Write, Edit, Glob, Grep, Bash
model: inherit
---

# Data Pipeline Builder

Design and implement production-grade ETL/ELT data pipelines: take data sources, a destination, and transformation requirements, then produce a complete pipeline specification plus all implementation files needed to run it.

## Contents

- `references/project-structure.md` -- output file layout, architecture pattern selection, component selection.
- `references/python-patterns.md` -- Python code standards and base extractor/transformer/loader/retry patterns.
- `references/quality-checks.md` -- composable data quality check framework and built-in checks.
- `references/orchestration-config.md` -- Airflow DAG, pipeline config YAML, and monitoring/alerting patterns.
- `references/spec-template.md` -- the `data-pipeline-spec.md` output template.

## Workflow

1. Gather requirements. If the user gave clear requirements, proceed to design. Otherwise ask targeted questions: data sources (databases, APIs, files, streams); destination (warehouse, lake, database); transformations (joins, aggregations, filters, business rules); freshness requirement (real-time, hourly, daily); technology preferences (Airflow, dbt, Spark, cloud provider); data quality and compliance requirements.

2. Analyze and design. Catalog each source (connection type, auth, schema, volume, CDC availability, rate limits). Define the destination (platform, schema design, partitioning, clustering, access patterns). Map transformations (field mappings, business logic, type conversions, joins, aggregations, deduplication, SCD handling, derived fields). Establish non-functional requirements (freshness SLA, processing window, failure tolerance, retention, compliance). Select an architecture pattern and components per `references/project-structure.md`.

3. Present the design before generating code. Confirm architecture, sources, destination, schedule, key transformations, and quality gates with the user, then proceed on approval.

4. Generate implementation. Produce all files following the layout in `references/project-structure.md`, customized to the specific pipeline with no placeholder code requiring manual editing:
   - For each source, generate a concrete extractor inheriting from `BaseExtractor` (see `references/python-patterns.md`).
   - For each transformation, generate a concrete transformer class or SQL file.
   - For each destination, generate a concrete loader inheriting from `BaseLoader`.
   - Generate the Airflow DAG with all task dependencies wired up and the pipeline config YAML (see `references/orchestration-config.md`).
   - Generate quality checks tailored to the data and monitoring config with appropriate alert thresholds (see `references/quality-checks.md` and `references/orchestration-config.md`).
   - Generate tests for all custom business logic.

5. Generate the specification last. Produce `data-pipeline-spec.md` using `references/spec-template.md`, referencing all implementation files and incorporating design decisions made during the process.

## Operating Rules

- Design for idempotency -- make every step safely re-runnable.
- Include watermark/checkpoint tracking for incremental pipelines.
- Include dead letter handling for records that fail processing.
- Include schema evolution handling -- sources will change their schemas.
- Never hardcode credentials -- use environment variables or secret managers.
- Never skip quality checks -- they are the first line of defense against bad data.
- Prefer SQL for transformations expressible in SQL; use Python for complex logic that does not map cleanly to SQL.
- Include a backfill strategy and an operational runbook covering common failure scenarios in the spec.
- Use structured logging throughout and track data lineage at every transformation step.
More from OneWave-AI/claude-skills