generate-idea
$
npx mdskill add GRIND-Lab-Core/night_owl_research_agent/generate-ideaGenerate publishable research ideas for: $ARGUMENTS
SKILL.md
.github/skills/generate-ideaView on GitHub ↗
---
name: generate-idea
description: Generate and rank research ideas given a broad direction. Use when user input "brainstorm ideas", "generate research ideas", "what can we work on", or wants to explore a research area for publishable directions.
argument-hint: [research-direction]
allowed-tools: Bash(*), Read, Write, Grep, Glob, WebSearch, WebFetch, Agent
---
# Research Idea Generator
Generate publishable research ideas for: $ARGUMENTS
## Overview
Given a broad research direction from the user, systematically generate, validate, and rank concrete research ideas. This skill composes with `/lit-review`, `/novelty-check`, and `/research-review` to form a complete idea discovery pipeline.
## Constants
- **PILOT_MAX_HOURS = 2** — Skip any pilot estimated to take > 2 hours per GPU. Flag as "needs manual pilot".
- **PILOT_TIMEOUT_HOURS = 3** — Hard timeout: kill pilots exceeding 3 hours. Collect partial results if available.
- **MAX_PILOT_IDEAS = 3** — Pilot at most 3 ideas in parallel. Additional ideas are validated on paper only.
- **MAX_TOTAL_GPU_HOURS = 8** — Total GPU budget for all pilots combined.
- **REVIEWER_MODEL = `gpt-5.4`** — Model used via Codex MCP for brainstorming and review. Must be an OpenAI model (e.g., `gpt-5.4`, `o3`, `gpt-4o`).
> 💡 Override via argument, e.g., `/generate-idea "topic" — pilot budget: 4h per idea, 20h total`.
## Workflow
### Phase 1: Get Literature Reivew Results
Map the research area to understand what exists and where the gaps are.
If `output/LIT_REVIEW_REPORT.md` exists (with Synthesis and Gap Analysis sections), read it and skip Phase 1 silently. If not, do the following:
1. **Scan local paper library first**: Check `papers/` in the project directory for existing PDFs. Read first 3 pages of relevant papers to build a baseline understanding before searching online. This avoids re-discovering what the user already knows.
2. **Search recent literature** using WebSearch:
- Top venues in the last 2 years
- Recent arXiv preprints (last 6 months)
- Use 5+ different query formulations
- Read abstracts and introductions of the top 10-15 papers
2. **Build a landscape map**:
- Group papers by sub-direction / approach
- Identify what has been tried and what hasn't
- Note recurring limitations mentioned in "Future Work" sections
- Flag any open problems explicitly stated by multiple papers
3. **Identify structural gaps**:
- Methods that work in domain A but haven't been tried in domain B
- Contradictory findings between papers (opportunity for resolution)
- Assumptions that everyone makes but nobody has tested
- Scaling regimes that haven't been explored
- Diagnostic questions that nobody has asked
### Phase 2: Idea Generation (brainstorm with external LLM)
Use the external LLM via Codex MCP for divergent thinking. If external LLM is not configured properly, use subagent with the most powerful model instead.
```
mcp__codex__codex:
model: REVIEWER_MODEL
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are a senior ML researcher brainstorming research ideas.
Research direction: [user's direction]
Here is the current landscape:
[paste landscape map from Phase 1]
Key gaps identified:
[paste gaps from Phase 1]
Generate 8-12 concrete research ideas. For each idea:
1. One-sentence summary
2. Core hypothesis (what you expect to find and why)
3. Minimum viable experiment (what's the cheapest way to test this?)
4. Expected contribution type: empirical finding / new method / theoretical result / diagnostic
5. Risk level: LOW (likely works) / MEDIUM (50-50) / HIGH (speculative)
6. Estimated effort: days / weeks / months
Prioritize ideas that are:
- Testable with moderate compute (8x RTX 3090 or less)
- Likely to produce a clear positive OR negative result (both are publishable)
- Not "apply X to Y" unless the application reveals genuinely surprising insights
- Differentiated from the 10-15 papers above
Be creative but grounded. A great idea is one where the answer matters regardless of which way it goes.
```
Save the threadId for follow-up.
### Phase 3: First-Pass Filtering
For each generated idea, quickly evaluate:
1. **Feasibility check**: Can we actually run this experiment with available resources?
- Compute requirements (estimate GPU-hours)
- Data availability
- Implementation complexity
- Skip ideas requiring > 1 week of GPU time or unavailable datasets
2. **Novelty quick-check**: For each idea, do 2-3 targeted searches to see if it's already been done. Full `/novelty-check` comes later for survivors.
3. **Impact estimation**: Would a reviewer care about the result?
- "So what?" test: if the experiment succeeds, does it change how people think?
- Is the finding actionable or just interesting?
Eliminate ideas that fail any of these. Typically 8-12 ideas reduce to 4-6.
### Phase 4: Deep Validation (for top ideas)
For each surviving idea, run a deeper evaluation:
1. **Novelty check**: Use the `/novelty-check` workflow (multi-source search + GPT-5.4 cross-verification) for each idea
2. **Critical review**: Use GPT-5.4 via `mcp__codex__codex-reply` (same thread):
```
Here are our top ideas after filtering:
[paste surviving ideas with novelty check results]
For each, play devil's advocate:
- What's the strongest objection a reviewer would raise?
- What's the most likely failure mode?
- How would you rank these for a top venue submission?
- Which 2-3 would you actually work on?
```
3. **Combine rankings**: Merge your assessment with GPT-5.4's ranking. Select top 2-3 ideas for pilot experiments.
### Phase 5: Parallel Pilot Experiments (for top 2-3 ideas)
Before committing to a full research effort, run cheap pilot experiments to get empirical signal. This is the key differentiator from paper-only validation.
**Phase 5.0: Mandatory Local GPU Availability Check (REQUIRED before any pilot launch)**
Before designing or launching any pilot, run the local GPU presence check. The result decides whether pilots run locally on GPU, route to remote/Modal, or skip with a clear note.
```bash
nvidia-smi --query-gpu=index,name,memory.used,memory.total --format=csv,noheader
```
Fallback to MPS only if `nvidia-smi` is missing or returns zero rows:
```bash
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"
```
Decision rules:
- `nvidia-smi` exits 0 with ≥ 1 GPU row → set `LOCAL_GPU=cuda`. **All pilots MUST run on the local GPU(s)**; do not launch on CPU and do not silently route to remote.
- Else MPS available → set `LOCAL_GPU=mps`. Pilots that fit MPS run locally; flag any pilot that requires CUDA-only ops as "needs CUDA" instead of falling back to CPU.
- Else `LOCAL_GPU=none` → only then route to remote/Modal as configured in `CLAUDE.md`, or mark the pilots as "needs pilot validation" and skip launching.
Record `LOCAL_GPU` and the device IDs assigned to each pilot in the `Pilot Experiment Results` table of `output/IDEA_REPORT.md`.
1. **Design pilots**: For each top idea, define the minimal experiment that would give a positive or negative signal:
- Single seed, small scale (e.g., small dataset subset, fewer epochs)
- Target: 30 min - PILOT_MAX_HOURS per pilot on 1 GPU
- **Estimate GPU-hours BEFORE launching.** If estimated time > PILOT_MAX_HOURS, reduce scale (fewer epochs, smaller subset) or flag as "needs manual pilot"
- Clear success metric defined upfront (e.g., "if metric improves by > 1%, signal is positive")
2. **Deploy in parallel**: Use `/deploy-experiment` to launch pilots. **When `LOCAL_GPU=cuda` or `mps`, pilots MUST be launched on the local GPU** — never CPU, never silently routed to remote unless `LOCAL_GPU=none`. With multiple local GPUs, distribute pilots across them:
```
GPU 0: Pilot for Idea 1
GPU 1: Pilot for Idea 2
GPU 2: Pilot for Idea 3
```
Use `run_in_background: true` to launch all at once. The downstream `/deploy-experiment` Step 0 will re-verify `LOCAL_GPU` — both checks must agree.
3. **Collect results**: Use `/monitor-experiment` to check progress. If any pilot exceeds PILOT_TIMEOUT_HOURS, kill it and collect partial results. Once all pilots complete (or timeout), compare:
- Which ideas showed positive signal?
- Which showed null/negative results? (eliminate or deprioritize)
- Any surprising findings that suggest a pivot?
- Total GPU-hours consumed (track against MAX_TOTAL_GPU_HOURS budget)
4. **Re-rank based on empirical evidence**: Update the idea ranking using pilot results. An idea with strong pilot signal jumps ahead of a theoretically appealing but untested idea.
Note: Skip this phase if the ideas are purely theoretical, or if Phase 5.0 returned `LOCAL_GPU=none` AND no remote/Modal GPU is configured. Flag skipped ideas as "needs pilot validation" in the report. **Never run a pilot on CPU when a local GPU is available.**
### Phase 6: Output — Ranked Idea Report
Write a structured report to `output/IDEA_REPORT.md` in the project root:
```markdown
# Research Idea Report
**Direction**: [user's research direction]
**Generated**: [date]
**Ideas evaluated**: X generated → Y survived filtering → Z piloted → W recommended
## Landscape Summary
[3-5 paragraphs on the current state of the field]
## Recommended Ideas (ranked)
### Idea 1: [title]
- **Hypothesis**: [one sentence]
- **Minimum experiment**: [concrete description]
- **Expected outcome**: [what success/failure looks like]
- **Novelty**: X/10 — closest work: [paper]
- **Feasibility**: [compute, data, implementation estimates]
- **Risk**: LOW/MEDIUM/HIGH
- **Contribution type**: empirical / method / theory / diagnostic
- **Pilot result**: [POSITIVE: metric +X% / NEGATIVE: no signal / SKIPPED: needs GPU]
- **Reviewer's likely objection**: [strongest counterargument]
- **Why we should do this**: [1-2 sentences]
### Idea 2: [title]
...
## Eliminated Ideas (for reference)
| Idea | Reason eliminated |
|------|-------------------|
| ... | Already done by [paper] |
| ... | Requires > 1 week GPU time |
| ... | Result wouldn't be interesting either way |
## Pilot Experiment Results
| Idea | GPU | Time | Key Metric | Signal |
|------|-----|------|------------|--------|
| Idea 1 | GPU 0 | 45 min | +2.3% CE | POSITIVE |
| Idea 2 | GPU 1 | 30 min | -0.1% CE | NEGATIVE |
| Idea 3 | GPU 2 | 1.5 hr | +0.8% CE | WEAK POSITIVE |
## Suggested Execution Order
1. Start with Idea 1 (positive pilot signal, lowest risk)
2. Idea 3 as backup (weak signal, may need larger scale to confirm)
3. Idea 2 eliminated by pilot — negative result documented
## Next Steps
- [ ] Scale up Idea 1 to full experiment (multi-seed, full dataset)
- [ ] If confirmed, invoke /auto-review-loop for full iteration
```
## Key Rules
- **Large file handling**: If the Write tool fails due to file size, immediately retry using Bash (`cat << 'EOF' > file`) to write in chunks. Do NOT ask the user for permission — just do it silently.
- The user provides a DIRECTION, not an idea. Your job is to generate the ideas.
- Quantity first, quality second: brainstorm broadly, then filter ruthlessly.
- A good negative result is just as publishable as a positive one. Prioritize ideas where the answer matters regardless of direction.
- Don't fall in love with any idea before validating it. Be willing to kill ideas.
- Always estimate compute cost. An idea that needs 1000 GPU-hours is not actionable for most researchers.
- "Apply X to Y" is the lowest form of research idea. Push for deeper questions.
- Include eliminated ideas in the report — they save future time by documenting dead ends.
- **If the user's direction is too broad (e.g., "NLP", "computer vision", "reinforcement learning"), STOP and ask them to narrow it.** A good direction is 1-2 sentences specifying the problem, domain, and constraint — e.g., "factorized gap in discrete diffusion LMs" or "sample efficiency of offline RL with image observations". Without sufficient specificity, generated ideas will be too vague to run experiments on.
## Composing with Other Skills
After this skill produces the ranked report:
```
/generate-idea "direction" → current step. Ranked ideas
/novelty-check "top idea" → deep novelty verification (already done in Phase 4, but user can re-run)
/research-review "top idea" → external critical feedback
implement → write code
/deploy-experiment → deploy to GPU
/auto-review-loop → iterate until submission-ready
```
More from GRIND-Lab-Core/night_owl_research_agent
- data-downloadDiscover, evaluate, and download publicly available datasets from the internet. Infers data needs from a research question or task, selects authoritative sources, downloads reproducibly, validates file integrity, and documents provenance. Pauses for user input when authentication, API keys, or major tradeoffs require a decision. Use when user says "download data", "get data", "find a dataset", "I need boundary files", "download census data", or needs any external dataset for analysis.
- deploy-experimentDeploy and run experiments for ML/DL training (local, remote, or Modal GPU) AND spatial data science / GIScience experiments (local, data-driven). Reads from output/refine-logs/EXPERIMENT_PLAN.md and output/refine-logs/FINAL_PROPOSAL.md, writes to output/experiment/. Use when user says "run experiment", "deploy experiment", "execute experiment plan", or needs to launch training / spatial analysis jobs.
- experiment-design-pipelineRun an end-to-end workflow that chains the skills `refine-research` and `experiment-design`. Use when the user wants a one-shot pipeline from vague research direction to focused final proposal plus detailed experiment roadmap, or asks to build a pipeline, do it end-to-end, or generate both the method and experiment plan together.
- full-pipelineComplete 4-stage end-to-end research pipeline. Orchestrates idea-discovery-pipeline → deploy-experiment → auto-review-loop → generate-report. Reads RESEARCH_PLAN.md (or BRIEF.md as fallback) for context that overrides $ARGUMENTS.
- idea-discovery-pipelineThe full pipeline for idea generation. It generates 8-12 novel research ideas from literature gaps and evaluates each on novelty, feasibility, and domain fit. Orchestrates lit-review → generate-idea → novelty-check → idea-review → experiment-design-pipeline to go from a broad research direction to a validated, pilot-tested idea with a refined proposal and experiment plan. Produces output/IDEA_REPORT.md plus refinement and experiment artifacts.
- lit-reviewRetrieves papers from local folder or ArXiv and Semantic Scholar using domain-aware keyword expansion, builds synthesis matrix, identifies gaps. Calls tools/arxiv_fetch.py and tools/semantic_scholar_fetch.py. Writes to output/paper-cache/ and output/LIT_REVIEW_REPORT.md.
- paper-covertConverts the final Markdown manuscript from `paper-draft` / `paper-review-loop` into a submission package for the target venue — modular LaTeX (one file per section), compiled PDF, and Word `.docx`. Venue is read from `output/PAPER_PLAN.md` (or argument) and routed through a small YAML profile. Does not rewrite prose, score, or invent citations.
- paper-draftTransforms output/PAPER_PLAN.md into a journal-quality Markdown manuscript draft for GIScience, GeoAI, spatial data science, and remote sensing venues (IJGIS, ISPRS JPRS, RSE, TGIS, AAG Annals). Consults referenced literature, experiment, figure, and claim artifacts; supports full drafts, partial drafts, and skeleton drafts depending on readiness. Never fabricates results, metrics, or citations — produces a claim-to-evidence map and coverage-gap report alongside the manuscript.
- paper-figure-generateGenerates publication-quality figures and diagrams from output/PAPER_PLAN.md for GIScience, GeoAI, and remote sensing journals (IJGIS, ISPRS JPRS, RSE, TGIS). Decides per-figure whether to produce reproducible code-generated plots/maps or structured prompts for external image-generation models (nano banana, ChatGPT image). Produces figure files, source scripts, captions, manifest, and prompt artifacts. Never fabricates results — uses only evidence from project files.
- paper-review-loopReviews the manuscript produced by `paper-draft` (in `output/manuscript/`) as a demanding IJGIS / ISPRS JPRS reviewer-editor, cross-checks it against `output/PAPER_PLAN.md` and its evidence artifacts, then revises it into a stronger draft. Produces a reviewed manuscript, a revised manuscript, a structured review report, a prioritized issue log, a revision log, claim-risk notes, journal-fit notes, and next-loop priorities. Supports full, section-scoped, and mode-scoped review (structural / argument / novelty / methods / results-discussion / journal-fit / language / integrated). Safe on partial or skeletal drafts. Never fabricates results, citations, or figures.