knowledge-graph-construction

Name: knowledge-graph-construction
Author: lyndonkl/claude

$npx mdskill add lyndonkl/claude/knowledge-graph-construction

Constructs knowledge graphs from unstructured data using layered architecture.

Selects optimal graph models like LPG, RDF, or hypergraphs for specific domains.
Designs entity and relation extraction pipelines to process raw inputs.
Aligns ontologies and validates graph quality before deployment.
Delivers structured schemas and ready-to-use graphs for RAG or reasoning.

SKILL.md

.github/skills/knowledge-graph-constructionView on GitHub ↗

---
name: knowledge-graph-construction
description: Designs and builds knowledge graphs from unstructured or semi-structured data sources. Guides through data model selection (LPG, RDF, hypergraph, temporal), schema design, entity/relation extraction pipelines, and layered architecture construction. Use when designing knowledge graphs, choosing between LPG vs RDF, planning entity extraction, designing graph schemas, aligning ontologies, building a KG for RAG, or when user mentions knowledge graph construction.
---

## Table of Contents
- [Workflow](#workflow)
- [Architecture Selection Guide](#architecture-selection-guide)
- [Schema Patterns](#schema-patterns)
- [Output Template](#output-template)

# Knowledge Graph Construction

## Workflow

**Copy this checklist** and work through each step:

```
KG Construction Progress:
- [ ] Step 1: Identify data sources and domain scope
- [ ] Step 2: Select graph data model
- [ ] Step 3: Design schema and ontology
- [ ] Step 4: Configure extraction pipeline
- [ ] Step 5: Define layered architecture
- [ ] Step 6: Validate and quality-check the graph
```

**Step 1: Identify data sources and domain scope**

Catalog the input data: document types (papers, clinical notes, web pages, logs), volume, update frequency, and language. Define the domain boundary -- what entity types and relation types matter for the target use case. Determine whether the KG will serve RAG retrieval, reasoning/inference, analytics, or a combination. This scoping step prevents over-extraction and keeps the schema focused.

**Step 2: Select graph data model**

Choose the underlying data model using the [Architecture Selection Guide](#architecture-selection-guide). Key trade-offs: LPG for flexibility and rapid prototyping, RDF/OWL for standards-based interoperability and inference, Hypergraphs for complex N-ary relations, Temporal Graphs for time-evolving knowledge. Consider query language, tooling maturity, and vector integration needs. For detailed model comparisons, see [Data Models Reference](./resources/data-models.md).

**Step 3: Design schema and ontology**

Define node types (entity classes), edge types (relation classes), and property schemas. Apply patterns from [Schema Patterns](#schema-patterns): entity-relation for simple domains, event reification for N-ary relations, layered tiers for multi-source integration. Decide on controlled vocabularies, cardinality constraints, and whether to adopt or extend an existing ontology (e.g., Schema.org, UMLS, SNOMED). For methodology details, see [Methodology Reference](./resources/methodology.md).

**Step 4: Configure extraction pipeline**

Build the pipeline that populates the graph. Core components: LLM-assisted entity extraction with multi-round verification, relation extraction via prompt-based or dependency-parsing methods, entity normalization (synonym merging, ontology linking), and schema enforcement through post-processing validation. Use few-shot examples in prompts to improve extraction consistency. Include a second-pass LLM verification to catch missed entities. For full pipeline design, see [Methodology Reference](./resources/methodology.md).

**Step 5: Define layered architecture**

Structure the KG into tiers for maintainability and trust. A common pattern: Layer 1 (instance data) holds user-specific or case-specific entities and relations; Layer 2 (domain knowledge) holds curated facts from literature or domain experts; Layer 3 (canonical ontology) holds the formal schema and upper ontology. Add provenance and evidence layering so every fact traces back to its source document, extraction method, and confidence score. Temporal subgraphs capture time-indexed state for domains where knowledge evolves.

**Step 6: Validate and quality-check the graph**

Run validation at multiple levels: schema conformance (do all nodes and edges match declared types?), coverage (are expected entity types populated?), consistency (no contradictory edges), and completeness (sample-based human review). Use a second LLM as a validator to fact-check extracted triples against source documents. Compute graph statistics (node degree distribution, connected components, orphan nodes) to identify extraction gaps. Quality criteria are defined in [Quality Rubric](./resources/evaluators/rubric_kg_construction.json).

## Architecture Selection Guide

### By Use Case

| Model | Flexibility | Standardization | Reasoning | Vector Integration | Query Language | Best For |
|-------|-------------|-----------------|-----------|-------------------|----------------|----------|
| LPG | High | Low | Limited | Native (Neo4j) | Cypher, Gremlin | Rapid development, RAG pipelines |
| RDF/OWL | Medium | High | Full (OWL-DL) | Via extensions | SPARQL | Interoperability, ontology-heavy domains |
| Hypergraph | High | Low | Limited | Custom | Custom APIs | N-ary relations, multi-entity events |
| Temporal | Medium | Low | Time-based | Via extensions | Temporal Cypher | Evolving knowledge, episodic memory |

### By Domain

| Domain | Recommended Model | Rationale |
|--------|-------------------|-----------|
| Biomedical / Clinical | RDF/OWL | UMLS/SNOMED ontologies, reasoning needed |
| Enterprise / RAG | LPG | Fast iteration, vector search integration |
| Event-centric (news, logs) | Hypergraph or Temporal | Multi-participant events, time evolution |
| Legal / Compliance | RDF/OWL | Formal reasoning, provenance chains |
| Scientific Literature | LPG + Layered | Flexible extraction, layered trust |

## Schema Patterns

### Entity-Relation Pattern

The simplest pattern. Nodes represent entities, edges represent binary relations. Properties on nodes hold attributes; properties on edges hold relation metadata (confidence, source, timestamp).

```
(:Person {name, role}) -[:WORKS_AT {since}]-> (:Organization {name, type})
(:Drug {name, class})  -[:TREATS {efficacy}]-> (:Disease {name, icd_code})
```

Best for: domains with primarily binary relationships and moderate complexity.

### Event Reification Pattern

Model N-ary relations and complex events as first-class nodes. An event node connects to all participants via typed role edges. This avoids information loss from forcing N-ary relations into binary edges.

```
(:ClinicalTrial {id, phase, start_date})
  -[:HAS_DRUG]->     (:Drug {name})
  -[:HAS_CONDITION]-> (:Disease {name})
  -[:HAS_OUTCOME]->   (:Outcome {measure, value})
  -[:CONDUCTED_BY]->  (:Organization {name})
```

Best for: events with multiple participants, clinical data, news events, financial transactions.

### Layered Tier Pattern

Separate the graph into trust-differentiated layers that can be queried independently or together.

```
Layer 3 (Canonical Ontology): Formal class hierarchy, relation definitions, constraints
Layer 2 (Domain Knowledge):   Curated facts from literature, expert-validated
Layer 1 (Instance Data):      Extracted from user documents, case-specific, lower confidence
```

Cross-layer edges link instances to domain concepts and domain concepts to ontology classes. Provenance metadata on every edge records: source document, extraction method, confidence score, and timestamp.

Best for: multi-source integration, RAG with trust scoring, enterprise knowledge management.

## Output Template

```
KNOWLEDGE GRAPH CONSTRUCTION SPECIFICATION
============================================

Domain: [Target domain and scope]
Use Case: [RAG / Reasoning / Analytics / Hybrid]
Data Sources: [List of input data types and volumes]

Data Model: [LPG / RDF / Hypergraph / Temporal]
Query Language: [Cypher / SPARQL / Gremlin / Custom]
Storage Backend: [Neo4j / Amazon Neptune / Virtuoso / etc.]

Schema Definition:
  Node Types:
  1. [EntityType] - [description]
     Properties: [list with types]
  2. [EntityType] - [description]
     Properties: [list with types]
  3. [Continue for each node type...]

  Edge Types:
  1. [RelationType] (source -> target) - [description]
     Properties: [list with types]
  2. [Continue for each edge type...]

  Constraints:
  - [Cardinality, uniqueness, required properties]

Extraction Pipeline:
  1. Entity Extraction
     - Method: [LLM-assisted / NER / Hybrid]
     - Prompt template: [summary or reference]
     - Verification: [Multi-round / Second-LLM / Manual sample]
  2. Relation Extraction
     - Method: [Prompt-based / Dependency parsing / Hybrid]
     - Few-shot examples: [count and source]
  3. Normalization
     - Deduplication: [method]
     - Ontology linking: [target ontology]
     - Synonym resolution: [approach]

Layered Architecture:
  Layer 1 (Instance): [description of instance-level data]
  Layer 2 (Domain):   [description of curated domain knowledge]
  Layer 3 (Ontology): [description of formal schema]
  Provenance: [How source/confidence/timestamp are tracked]

Validation Plan:
  - Schema conformance: [automated checks]
  - Coverage: [expected entity/relation counts]
  - Consistency: [contradiction detection method]
  - Human review: [sampling strategy]

Estimated Scale: [node count, edge count, properties per node]
Key Dependencies: [libraries, APIs, ontologies]

NEXT STEPS:
- Implement extraction pipeline on sample data
- Populate graph and run validation suite
- Iterate schema based on extraction results
- Integrate with downstream application (RAG, reasoning, etc.)
```

More from lyndonkl/claude

Skill	Description
abstraction-concrete-examples	Builds structured abstraction ladders that translate high-level principles into concrete, actionable examples across 3-5 levels. Bridges communication gaps, reveals hidden assumptions, and tests whether abstract ideas work in practice. Use when explaining concepts at different expertise levels, moving between abstract principles and concrete implementation, identifying edge cases by testing ideas against scenarios, designing layered documentation, decomposing complex problems into actionable steps, or bridging strategy-execution gaps.
academic-letter-architect	Guides the creation of evidence-based academic recommendation letters, reference letters, and award nominations that combine concrete examples, meaningful comparisons, and genuine enthusiasm. Use when writing recommendation letters for students, postdocs, or colleagues, or when user mentions recommendation letter, reference, nomination, letter of support, endorsement, or needs help with strong advocacy and comparative statements.
adr-architecture	Documents significant architectural and technical decisions with full context, alternatives considered, trade-offs analyzed, and consequences understood. Creates a decision trail that helps teams understand why decisions were made. Use when choosing between technology options, making infrastructure decisions, establishing standards, migrating systems, or when user mentions ADR, architecture decision, technical decision record, or decision documentation.
adverse-selection-prior	Produces a Bayesian prior probability that an offered transaction is +EV for the recipient, given that the counterparty chose to propose it. Applies Akerlof market-for-lemons logic -- if they offered it, they believe it is +EV for them, so the prior that it is +EV for us is materially below 50%. Reusable across trade evaluation, waiver drops (another team dropping a player is also adverse selection), job-offer analysis, M&A, and any "someone offered me this" situation. Use when you receive an unsolicited trade/offer/proposal, analyzing incoming trade prior, evaluating why a counterparty proposed a deal, or when user mentions adverse selection, market for lemons, why did they offer this, incoming trade prior, they proposed it, Bayesian adjustment on received offer.
alignment-values-north-star	Creates actionable alignment frameworks that give teams a shared North Star (direction), values (guardrails), and decision tenets (behavioral standards). Enables autonomous decision-making while maintaining organizational coherence. Use when starting new teams, scaling organizations, defining culture, establishing product vision, resolving misalignment, creating strategic clarity, or when user mentions North Star, team values, mission, principles, guardrails, decision framework, or cultural alignment.
analogy-weight-check	For every analogy in a substacker draft, verifies it carries mechanical weight — the analogy does real work explaining the mechanism, not merely decorates it. Cross-references analogy-catalog.md for novelty (is this analogy reused from a prior post?) and domain fit (biology > organizational > sports preferred; physics/military disfavored). Use whenever an analogy appears in the draft. Trigger keywords: analogy weight, decorative, mechanical weight, reused analogy, catalog check, metaphor check.
answer-uncomfortable-question	Takes one strategic question about substacker ("should we launch paid?", "is this section dead?", "are we writing for the wrong audience?") and produces the mandatory evidence + reasoning + downside triad plus a recommendation. Used 3 times per Growth Strategist review. Trigger keywords: uncomfortable question, strategic question, evidence reasoning downside, triad.
attribute-performance	For each substacker post that materially over- or under-performs the rolling baseline (\|z\| ≥ 1.0), produces a plain-English attribution paragraph with calibrated confidence (high / medium / low / unexplained). Considers subject-line effect, topic zeitgeist, external share, day-of-week, length effect, and audience-notes signals. Labels unexplained outliers explicitly rather than fabricating a story. Use after compute-baseline when outlier posts exist. Trigger keywords: attribution, why did this post work, outlier explanation, performance analysis.
auction-first-price-shading	Computes the optimal shaded bid for a first-price sealed-bid auction given a true private value, an estimate of the number of competing bidders N, and a value-distribution assumption. Implements the `(N-1)/N` equilibrium shading rule for uniform private values, adjusts for log-normal or empirical value distributions, layers a risk-aversion adjustment, and caps output against the bidder's remaining budget. Domain-neutral auction theory reusable across fantasy sports (baseball FAAB, NBA/NHL waiver auctions), prediction-market limit sizing, sealed procurement bids, and any blind-bid context. Use when user mentions "first-price auction bid", "sealed bid shading", "(N-1)/N", "FAAB bid amount", "auction shading", "optimal bid first-price", "bid for sealed-bid", "blind bid sizing", or when downstream logic needs a principled shade factor rather than an ad-hoc heuristic.
auction-winners-curse-haircut	Applies a Bayesian haircut to a bid valuation for common-value auctions where winning is itself evidence the bidder over-estimated. Takes a raw valuation, a value-type classification (common_value / private_value / mixed), the number of informed bidders N, and a signal-dispersion estimate, and returns an adjusted valuation. Domain-neutral and reusable across fantasy FAAB, prediction markets, M&A bids, ad-auction budgets, and any generic bidding context. Use when user mentions "winner's curse", "common value auction", "valuation haircut", "adverse valuation", "Bayesian bid adjustment", or "over-paying in auction".