construct-validity-assessment
$
npx mdskill add yogsoth-ai/de-anthropocentric-research-engine/construct-validity-assessmentAssesses if a benchmark accurately measures its claimed capability
- Evaluates alignment between benchmark tasks and intended capability
- Uses psychometric validity frameworks adapted for AI evaluation
- Analyzes task examples for content, convergent, and discriminant validity
- Produces a structured validity verdict with evidence for each dimension
SKILL.md
.github/skills/construct-validity-assessmentView on GitHub ↗
--- name: construct-validity-assessment description: Evaluate whether benchmark measures its claimed capability execution: subagent prompt: ./prompt.md input: benchmark_name, claimed_capability, task_examples used-by: benchmark-archaeology --- # Construct Validity Assessment SOP Evaluate whether a benchmark actually measures the capability it claims to measure, using psychometric validity frameworks adapted for AI evaluation. ## Input - **benchmark_name**: Name of the benchmark - **claimed_capability**: What the benchmark authors claim it measures - **task_examples**: Representative examples from the benchmark ## Procedure 1. Define the construct (claimed capability) precisely 2. Analyze task requirements — what skills are actually needed to solve examples? 3. Assess content validity — do items representatively sample the construct? 4. Check convergent validity — correlation with other measures of same construct 5. Check discriminant validity — independence from unrelated constructs 6. Identify construct-irrelevant variance (confounds) ## Output Validity verdict with evidence for each validity dimension.
More from yogsoth-ai/de-anthropocentric-research-engine
- abductive-hypothesis-generationStrategy: 面对异常的最佳解释推理
- ablation-brainstormRemove components one by one, observe system changes to reveal hidden dependencies and generate ideas from structural gaps.
- ablation-component-mappingMap system architecture to ablatable units for ablation studies
- ablation-designDesign ablation studies to isolate component contributions in ML systems
- ablation-executionRemove components one by one from a system, record the response/impact of each removal.
- abp-vulnerability-classificationClassify assumptions on 2 axes — load-bearing (how much conclusion depends on it) × vulnerable (how likely to be false). Focuses attention on High-Load × High-Vulnerable quadrant.
- abstraction-extractionExtract abstract principles from concrete domain cases. Strips domain-specific details to reveal transferable mechanisms.
- abstraction-ladderPerform bisociation at multiple abstraction levels
- abstraction-ladderingMove between concrete and abstract framings — 3 levels up (Why?) and 3 levels down (How?) to find the most productive research level.
- abstraction-to-designAbstract biological principle to design principle. Bridge from biology to engineering.