arckit-data-model

Name: arckit-data-model
Author: tractorjuice/arc-kit
$npx mdskill add tractorjuice/arc-kit/arckit-data-model
Build compliant data models anchored in explicit requirements.
Generates entity relationships and governance rules from DR and NFR-SEC inputs.
Depends on ARC-*.md artifacts and global policy documents for context.
Validates outputs against mandatory requirements before model generation.
Delivers structured schemas ready for database design and API specs.
SKILL.md
.github/skills/arckit-data-modelView on GitHub ↗
---
name: arckit-data-model
description: "Create comprehensive data model with entity relationships, GDPR compliance, and data governance"
---

You are helping an enterprise architect create a comprehensive data model for a project that will guide database design, API specifications, and compliance requirements.

## User Input

```text
$ARGUMENTS
```

## Instructions

> **Note**: Before generating, scan `projects/` for existing project directories. For each project, list all `ARC-*.md` artifacts, check `external/` for reference documents, and check `000-global/` for cross-project policies. If no external docs exist but they would improve output, ask the user.

1. **Read existing artifacts from the project context:**

   **MANDATORY** (warn if missing):
   - **REQ** (Requirements)
     - Extract: All DR (data requirements), NFR-SEC (security/privacy), INT (integration/data exchange), BR (data-related business requirements)
     - If missing: STOP and warn user to run `$arckit-requirements` first — data model MUST be based on DR-xxx requirements

   **RECOMMENDED** (read if available, note if missing):
   - **STKE** (Stakeholder Analysis)
     - Extract: Data owners from RACI matrix, governance stakeholders, data stewardship responsibilities
   - **PRIN** (Architecture Principles, in 000-global)
     - Extract: Data governance standards, privacy by design principles, data sovereignty requirements

   **OPTIONAL** (read if available, skip silently if missing):
   - **SOBC** (Business Case)
     - Extract: Data-related benefits and costs
   - **RSCH** (Research Findings)
     - Extract: Database technology recommendations, data platform choices

2. **Identify the target project**:
   - Use the **ArcKit Project Context** (above) to find the project matching the user's input (by name or number)
   - If no match, create a new project:
     1. Use Glob to list `projects/*/` directories and find the highest `NNN-*` number (or start at `001` if none exist)
     2. Calculate the next number (zero-padded to 3 digits, e.g., `002`)
     3. Slugify the project name (lowercase, replace non-alphanumeric with hyphens, trim)
     4. Use the Write tool to create `projects/{NNN}-{slug}/README.md` with the project name, ID, and date — the Write tool will create all parent directories automatically
     5. Also create `projects/{NNN}-{slug}/external/README.md` with a note to place external reference documents here
     6. Set `PROJECT_ID` = the 3-digit number, `PROJECT_PATH` = the new directory path

3. **Read external documents and policies**:
   - Read any **external documents** listed in the project context (`external/` files) — extract entity definitions, relationships, data types, constraints, existing schemas, migration requirements
   - Read any **enterprise standards** in `projects/000-global/external/` — extract enterprise data dictionaries, master data management standards, cross-project data architecture patterns
   - If no external docs exist but they would improve the data model, ask: "Do you have any existing database schemas, ERD diagrams, or data dictionaries? I can read PDFs, images, and SQL files directly. Place them in `projects/{project-dir}/external/` and re-run, or skip."
   - **Citation traceability**: When referencing content from external documents, follow the citation instructions in `.arckit/references/citation-instructions.md`. Place inline citation markers (e.g., `[PP-C1]`) next to findings informed by source documents and populate the "External References" section in the template.

4. **Read the template** (with user override support):
   - **First**, check if `.arckit/templates/data-model-template.md` exists in the project root
   - **If found**: Read the user's customized template (user override takes precedence)
   - **If not found**: Read `.arckit/templates/data-model-template.md` (default)

   > **Tip**: Users can customize templates with `$arckit-customize data-model`

5. **Extract data requirements**:
   - Read the project's requirements document (`ARC-*-REQ-*.md`)
   - Extract ALL Data Requirements (DR-xxx)
   - Also look for privacy/GDPR requirements in NFR section
   - Identify integration requirements (INT-xxx) that involve data exchange
   - Note any data-related business requirements (BR-xxx)

6. **Load Mermaid Syntax Reference**:
   - Read `.arckit/skills/mermaid-syntax/references/entityRelationshipDiagram.md` for official Mermaid ER diagram syntax — entity definitions, relationship types, cardinality notation, and attribute syntax.

7. **Generate comprehensive data model**:

   **A. Executive Summary**:
   - Total number of entities identified
   - Data classification summary (Public, Internal, Confidential, Restricted)
   - PII/sensitive data identified (Yes/No)
   - GDPR/DPA 2018 compliance status
   - Key data governance stakeholders

   **B. Visual Entity-Relationship Diagram (ERD)**:
   - Create Mermaid ERD syntax showing:
     - All entities (E-001, E-002, etc.)
     - Relationships (one-to-one, one-to-many, many-to-many)
     - Cardinality notation
   - Organise by logical domain/bounded context if possible
   - Use descriptive entity and relationship names

   **C. Entity Catalog** (E-001, E-002, etc.):
   - For each entity, document:
     - **Entity ID**: E-001, E-002, etc.
     - **Entity Name**: Customer, Transaction, Product, etc.
     - **Description**: What this entity represents
     - **Source Requirement**: Which DR-xxx requirement(s) drive this entity
     - **Business Owner**: From stakeholder RACI matrix
     - **Technical Owner**: Data steward or database team
     - **Data Classification**: Public/Internal/Confidential/Restricted
     - **Estimated Volume**: Initial records + growth rate
     - **Retention Period**: How long data is kept (GDPR requirement)
     - **Attributes Table**:

       ```text
       | Attribute | Type | Required | PII | Description | Validation | Source Req |
       |-----------|------|----------|-----|-------------|------------|------------|
       | customer_id | UUID | Yes | No | Unique identifier | UUID v4 | DR-001 |
       | email | String(255) | Yes | Yes | Contact email | RFC 5322, unique | DR-002 |
       ```

     - **Relationships**: What other entities this connects to
     - **Indexes**: Primary key, foreign keys, performance indexes
     - **Privacy Notes**: GDPR considerations, data subject rights

   **D. Data Governance Matrix**:
   - For each entity, identify:
     - **Data Owner**: Business stakeholder responsible (from RACI matrix)
     - **Data Steward**: Person responsible for quality and compliance
     - **Data Custodian**: Technical team managing storage/backups
     - **Access Control**: Who can view/modify (roles/permissions)
     - **Sensitivity**: Public, Internal, Confidential, Restricted
     - **Compliance**: GDPR, PCI-DSS, HIPAA, etc.
     - **Quality SLA**: Accuracy, completeness, timeliness targets

   **E. CRUD Matrix** (Create, Read, Update, Delete):
   - Map which components/systems can perform which operations on each entity
   - Example:

     ```text
     | Entity | Payment API | Admin Portal | Reporting Service | CRM Integration |
     |--------|-------------|--------------|-------------------|-----------------|
     | E-001: Customer | CR-- | CRUD | -R-- | -R-- |
     | E-002: Transaction | CR-- | -R-- | -R-- | ---- |
     ```

   - Helps identify unauthorized access patterns and data flows

   **F. Data Integration Mapping**:
   - **Upstream Systems**: Where data comes from
     - System name, entity mapping, update frequency, data quality SLA
   - **Downstream Systems**: Where data goes to
     - System name, entity mapping, sync method (API, batch, event), latency SLA
   - **Master Data Management**: Which system is "source of truth" for each entity

   **G. Privacy & Compliance**:
   - **GDPR/DPA 2018 Compliance**:
     - List all PII attributes across all entities
     - Document legal basis for processing (consent, contract, legitimate interest, etc.)
     - Data subject rights implementation (access, rectification, erasure, portability)
     - Data retention schedules per entity
     - Cross-border data transfer considerations (UK-EU adequacy)
   - **Data Protection Impact Assessment (DPIA)**:
     - Is DPIA required? (Yes if high-risk processing of PII)
     - Key privacy risks identified
     - Mitigation measures
     - ICO notification requirements
   - **Sector-Specific Compliance**:
     - PCI-DSS: If payment card data (special handling requirements)
     - HIPAA: If healthcare data (US projects)
     - FCA regulations: If financial services (UK)
     - Government Security Classifications: If public sector (OFFICIAL, SECRET)

   **H. Data Quality Framework**:
   - **Quality Dimensions**:
     - **Accuracy**: How correct is the data? (validation rules, reference data)
     - **Completeness**: Required fields populated? (% target)
     - **Consistency**: Same data across systems? (reconciliation rules)
     - **Timeliness**: How current is the data? (update frequency, staleness tolerance)
     - **Uniqueness**: No duplicates? (deduplication rules)
     - **Validity**: Conforms to format? (regex patterns, enums, ranges)
   - **Data Quality Metrics**:
     - Define measurable targets per entity (e.g., "Customer email accuracy >99%")
     - Data quality monitoring approach
     - Data quality issue resolution process

   **I. Requirements Traceability**:
   - Create traceability table:

     ```text
     | Requirement | Entity | Attributes | Rationale |
     |-------------|--------|------------|-----------|
     | DR-001 | E-001: Customer | customer_id, email, name | Store customer identity |
     | DR-002 | E-002: Transaction | transaction_id, amount, status | Track payments |
     | NFR-SEC-003 | E-001: Customer | password_hash (encrypted) | Secure authentication |
     ```

   - Show how every DR-xxx requirement maps to entities/attributes
   - Flag any DR-xxx requirements NOT yet modeled (gaps)

   **J. Implementation Guidance**:
   - **Database Technology Recommendation**:
     - Relational (PostgreSQL, MySQL) for transactional data
     - Document (MongoDB, DynamoDB) for flexible schemas
     - Graph (Neo4j) for highly connected data
     - Time-series (InfluxDB, TimescaleDB) for metrics/events
   - **Schema Migration Strategy**: How to evolve schema (Flyway, Liquibase, Alembic)
   - **Backup and Recovery**: RPO/RTO targets, backup frequency
   - **Data Archival**: When to move data from hot to cold storage
   - **Testing Data**: Anonymization/pseudonymization for test environments

8. **UK Government Compliance** (if applicable):
   - **Government Security Classifications**: OFFICIAL, SECRET, TOP SECRET
   - **Data Standards**: Use GDS Data Standards Catalogue where applicable
   - **Open Standards**: Preference for open data formats (JSON, CSV, OData)
   - **ICO Data Protection**: Reference ICO guidance for public sector
   - **National Cyber Security Centre (NCSC)**: Data security patterns

Before writing the file, read `.arckit/references/quality-checklist.md` and verify all **Common Checks** plus the **DATA** per-type checks pass. Fix any failures before proceeding.

9. **Write the output**:
   - Write to `projects/{project-dir}/ARC-{PROJECT_ID}-DATA-v1.0.md`
   - Use the exact template structure from `data-model-template.md`
   - Include Mermaid ERD at the top for quick visualization
   - Include all sections even if some are TBD
   - Create comprehensive entity catalog with ALL attributes

**IMPORTANT - Auto-Populate Document Information Fields**:

Before completing the document, populate document information fields:

### Auto-populated fields

- `[PROJECT_ID]` → Extract from project path (e.g., "001")
- `[VERSION]` → Start with "1.0" for new documents
- `[DATE]` / `[YYYY-MM-DD]` → Current date in YYYY-MM-DD format
- `[DOCUMENT_TYPE_NAME]` → Document purpose
- `ARC-[PROJECT_ID]-DATA-v[VERSION]` → Generated document ID
- `[STATUS]` → "DRAFT" for new documents
- `[CLASSIFICATION]` → Default to "OFFICIAL" (UK Gov) or "PUBLIC"

### User-provided fields

- `[PROJECT_NAME]` → Full project name
- `[OWNER_NAME_AND_ROLE]` → Document owner

### Revision History

```markdown
| 1.0 | {DATE} | ArcKit AI | Initial creation from `$arckit-data-model` command |
```

### Generation Metadata Footer

```markdown
**Generated by**: ArcKit `$arckit-data-model` command
**Generated on**: {DATE}
**ArcKit Version**: {ARCKIT_VERSION}
**Project**: {PROJECT_NAME} (Project {PROJECT_ID})
**AI Model**: [Actual model name]
```

10. **Summarize what you created**:

- How many entities defined (E-001, E-002, etc.)
- How many total attributes across all entities
- How many entities contain PII (privacy-sensitive)
- Data classification breakdown (Public/Internal/Confidential/Restricted)
- GDPR compliance status (compliant / needs DPIA / gaps identified)
- Key data governance stakeholders identified
- Requirements coverage (% of DR-xxx requirements modeled)
- Suggested next steps (e.g., "Review data model with data protection officer before proceeding to HLD" or "Run `$arckit-hld-review` to validate database technology choices")

## Example Usage

User: `$arckit-data-model Create data model for payment gateway project`

You should:

- Check prerequisites (requirements document exists, stakeholder analysis recommended)
- Find project directory (e.g., `projects/001-payment-gateway-modernization/`)
- Extract DR-xxx requirements from the requirements document
- Generate comprehensive data model:
  - Mermaid ERD showing Customer, Transaction, PaymentMethod, RefundRequest entities
  - Detailed entity catalog with attributes, PII flags, retention periods
  - GDPR compliance: PII identified, legal basis documented, DPIA required
  - Data governance: CFO owns financial data, DPO owns PII, IT owns storage
  - CRUD matrix: Payment API can create transactions, Admin can read all, Reporting read-only
  - PCI-DSS compliance: Payment card data encrypted, tokenized, not stored long-term
  - Requirements traceability: All DR-001 through DR-008 mapped to entities
- **CRITICAL - Token Efficiency**: Use the **Write tool** to create `projects/001-payment-gateway-modernization/ARC-001-DATA-v1.0.md`
  - **DO NOT** output the full document in your response (this exceeds 32K token limit!)
- Show summary only (see Output Instructions below)

## Important Notes

- **Data model drives database schema, API contracts, and data governance policies**
- **GDPR compliance is MANDATORY for any PII - identify and protect it**
- **Every entity MUST trace back to at least one DR-xxx requirement**
- **Data ownership is critical - assign business owners from stakeholder RACI matrix**
- **PII requires special handling**: encryption at rest, encryption in transit, access controls, audit logging, retention limits
- **Use Mermaid ERD syntax** for GitHub-renderable diagrams (not PlantUML or other formats)
- **Data quality metrics should be measurable** (not "high quality", use "99% accuracy")
- **Consider data lifecycle**: creation, updates, archival, deletion (GDPR "right to erasure")
- **Reference architecture principles** from any `ARC-000-PRIN-*.md` file in `projects/000-global/` if they exist
- **Flag any DR-xxx requirements that cannot be modeled** (gaps for requirements clarification)
- **UK Government data projects**: The data model supports [National Data Strategy](https://www.gov.uk/government/publications/uk-national-data-strategy/national-data-strategy) alignment — Data Foundations pillar (metadata, standards, quality) and Availability pillar (data access, sharing). The Data Quality Framework section maps to the [Government Data Quality Framework](https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework) 6 dimensions. See `docs/guides/national-data-strategy.md` and `docs/guides/data-quality-framework.md` for full mappings.

- **Markdown escaping**: When writing less-than or greater-than comparisons, always include a space after `<` or `>` (e.g., `< 3 seconds`, `> 99.9% uptime`) to prevent markdown renderers from interpreting them as HTML tags or emoji

## Integration with Other Commands

- **Input**: Requires requirements document (`ARC-*-REQ-*.md`) for DR-xxx requirements
- **Input**: Uses stakeholder analysis (`ARC-*-STKE-*.md`) for data ownership RACI matrix
- **Input**: References SOBC (`ARC-*-SOBC-*.md`) for data-related costs and benefits
- **Output**: Feeds into `$arckit-hld-review` (validates database technology choices)
- **Output**: Feeds into `$arckit-dld-review` (validates schema design, indexes, query patterns)
- **Output**: Feeds into `$arckit-sow` (RFP includes data migration, data governance requirements)
- **Output**: Supports `$arckit-traceability` (DR-xxx → Entity → Attribute → HLD Component)

## Output Instructions

**CRITICAL - Token Efficiency**:

### 1. Generate Data Model

Create the comprehensive data model following the template structure with all sections.

### 2. Write Directly to File

**Use the Write tool** to create `projects/[PROJECT]/ARC-{PROJECT_ID}-DATA-v1.0.md` with the complete data model.

**DO NOT** output the full document in your response. This would exceed token limits.

### 3. Show Summary Only

After writing the file, show ONLY a concise summary:

```markdown
## Data Model Complete ✅

**Project**: [Project Name]
**File Created**: `projects/[PROJECT]/ARC-{PROJECT_ID}-DATA-v1.0.md`

### Data Model Summary

**Entities**: [Number] entities modeled
- Core Entities: [List main entities, e.g., Customer, Order, Payment]
- Supporting Entities: [List supporting entities]
- Lookup/Reference Data: [List reference tables]

**Relationships**: [Number] relationships defined
- One-to-Many: [Number]
- Many-to-Many: [Number]
- One-to-One: [Number]

**Attributes**: [Number] total attributes across all entities
- PII Attributes: [Number] (GDPR-sensitive)
- Encrypted Attributes: [Number]
- Indexed Attributes: [Number] (for performance)

**GDPR Compliance**:
- PII Entities: [List entities containing PII]
- Legal Basis: [e.g., Consent, Contract, Legitimate Interest]
- DPIA Required: [Yes/No]
- Retention Periods: [Range, e.g., 6 months to 7 years]

**Data Governance**:
- Data Owners: [Number] stakeholders assigned as data owners
- CRUD Matrix: [Number] roles/systems defined
- Access Controls: [Summary of who can access what]

**Compliance Requirements**:
- [List: GDPR, PCI-DSS, HIPAA, SOX, etc. as applicable]

**Requirements Traceability**:
- Data Requirements Mapped: [Number] DR-xxx requirements
- Unmapped Requirements: [Number] (need clarification)

### What's in the Document

- Entity Relationship Diagram (Mermaid ERD)
- Detailed Entity Catalog (all attributes, data types, constraints)
- GDPR Compliance Matrix (PII identification and protection)
- Data Governance Framework (ownership, CRUD matrix)
- Data Quality Metrics (accuracy, completeness, timeliness targets)
- Data Retention Policy (by entity)
- Encryption and Security Requirements
- Requirements Traceability Matrix (DR-xxx → Entity mapping)

### Next Steps

- Review `ARC-{PROJECT_ID}-DATA-v1.0.md` for full ERD and entity details
- Validate with data owners and stakeholders
- Run `$arckit-research` to research database technologies
- Run `$arckit-hld-review` after HLD is created
```

**Statistics to Include**:

- Number of entities
- Number of relationships
- Number of PII attributes
- Number of data requirements mapped
- Number of data owners assigned
- DPIA required (yes/no)
- Compliance frameworks applicable

Generate the data model now, write to file using Write tool, and show only the summary above.

## Suggested Next Steps

After completing this command, consider running:

- `$arckit-hld-review` -- Validate database technology choices
- `$arckit-dld-review` -- Validate schema design and query patterns
- `$arckit-sow` -- Include data migration and governance in RFP
- `$arckit-traceability` -- Map DR-xxx to entities and attributes