farmer-setup

Name: farmer-setup
Author: huggingface/swarm-sweeper
$npx mdskill add huggingface/swarm-sweeper/farmer-setup
Configure slop-farmer end-to-end with GitHub and Hugging Face.
Creates reproducible repo configs, auth, datasets, and dashboards.
Integrates GitHub CLI, Hugging Face CLI, and Python scripts.
Audits repo patterns and suggests automated cleanup strategies.
Delivers a fully functional scraping pipeline ready for deployment.
SKILL.md

.github/skills/farmer-setupView on GitHub ↗
---
name: farmer-setup
description: Set up slop-farmer for a GitHub repository end-to-end. Use this whenever the user mentions slop-farmer setup, creating a repo config, choosing scrape limits, checking GitHub rate limits, cleaning PR template boilerplate, inspecting issue/PR title patterns with gh, installing or authenticating the Hugging Face hf CLI, publishing datasets, reproducing dashboards, or deploying the static dashboard to a Hugging Face Space. Be proactive even if the user only asks for one piece of the setup, because config, auth, dataset publishing, and dashboard deployment are tightly coupled here.
compatibility:
  tools: [Read, Write, Edit, Bash]
---

# Slop Farmer Setup

Help the user go from “I want to run slop-farmer on repo X” to a reproducible setup with:

- a working YAML config
- valid GitHub and Hugging Face authentication
- sensible scrape defaults
- repo-specific PR cleanup and suppression rules where possible
- a publishable dataset
- a reproducible static dashboard deployment to a Hugging Face Space

Use the bundled resources:

- Read `references/config-template.yaml` when drafting a config.
- Run `scripts/repo_setup_audit.py <owner/name>` when `gh` is available and the user wants repo-specific sizing or cleanup advice.
- Run `scripts/suggest_setup_config.py <owner/name>` when you want a stronger automated pass over repeated PR/issue body patterns, heuristic cleanup candidates, and an optional fast-agent synthesis branch.

## Important product truths

Be accurate about the current codebase:

- The preferred GitHub env var is `GITHUB_TOKEN`.
- `GH_TOKEN` and `GRAPHQL_TOKEN` are also accepted in code paths.
- If no env var is set, slop-farmer can often fall back to `gh auth token`.
- Hugging Face auth defaults to `HF_TOKEN`, otherwise an existing `hf auth login`.
- Repo-specific YAML config support already exists via `--config`.
- Built-in template cleanup is currently **PR-focused** via `pull-requests.template_cleanup`.
- There is **not** an equivalent first-class issue-body template cleanup setting in the current config. If the user wants issue boilerplate removed, identify likely patterns and clearly label them as manual guidance or future-work, not as a currently supported config knob unless you verify otherwise.
- `deploy-dashboard` publishes a **static dashboard** to a Hugging Face Space. That is different from running the whole pipeline continuously inside a Space.
- `gh api rate_limit` is valid because `gh api` can call arbitrary GitHub REST endpoints, including `/rate_limit`.

## Default response shape

Unless the user asks for something narrower, structure your answer like this:

1. **What I checked**
2. **Recommended config**
3. **Auth setup**
4. **Boilerplate / suppression recommendations**
5. **Run commands**
6. **Space + robustness plan**
7. **Open questions / follow-ups**

Always include a copy-pasteable YAML config and exact shell commands when possible.

## Workflow

### 1) Capture the minimum setup intent

Quickly determine:

- target repo (`owner/name`)
- whether this is a smoke test, recent-history bootstrap, or recurring production-ish run
- whether the user wants only local setup, or also dataset publishing and dashboard deployment
- whether `gh`, `hf`, `uv`, and `npm` are available
- whether they already have GitHub and Hugging Face tokens/logins

If the user already named a repo, move directly into inspection instead of asking generic questions.

### 2) Verify prerequisites

Prefer short, concrete checks.

Recommended checks:

```bash
uv --version
gh --version
hf version
npm --version
```

If `hf` is missing or old, recommend:

```bash
python -m pip install -U huggingface_hub
hf version
```

If the user prefers uv-managed tools, it is also reasonable to suggest:

```bash
uv tool install --upgrade huggingface_hub
```

Then confirm auth with one of:

```bash
hf auth whoami
printenv HF_TOKEN
```

If neither works, recommend:

```bash
hf auth login
```

For headless environments, prefer `HF_TOKEN` over interactive login.

### 3) Set up GitHub auth correctly

If the user says “GITHUB_ACCESS_TOKEN” or similar, clarify that the main supported variable here is:

```bash
export GITHUB_TOKEN=...
```

Useful verification commands:

```bash
printenv GITHUB_TOKEN
gh auth status
gh auth token >/dev/null && echo "gh token available"
gh api rate_limit
```

When advising on token creation:

- Prefer a fine-grained PAT when possible.
- For public repos, read access to repo contents/metadata, issues, and pull requests is usually the right starting point.
- For private repos, say the token must cover that repo.
- Do not imply write access is required for GitHub scraping.

### 4) Inspect the target repo before choosing defaults

If `gh` is available, audit the repo before finalizing config.

Preferred method:

```bash
uv run skills/farmer-setup/scripts/repo_setup_audit.py owner/name
```

For a richer automation pass that also inspects recent PR/issue bodies and can optionally prepare or run a fast-agent synthesis prompt:

```bash
uv run skills/farmer-setup/scripts/suggest_setup_config.py owner/name --prepare-fast-agent
```

If you do not use the script, gather equivalent information manually:

```bash
gh api rate_limit
gh api repos/OWNER/REPO
gh pr list -R OWNER/REPO --state all --limit 100 --json number,title,createdAt,url
gh issue list -R OWNER/REPO --state all --limit 100 --json number,title,createdAt,url
gh api repos/OWNER/REPO/contents/.github
```

For template discovery, also inspect the standard GitHub template locations on the default branch:

```bash
gh api repos/OWNER/REPO/contents/.github/PULL_REQUEST_TEMPLATE.md
gh api repos/OWNER/REPO/contents/.github/pull_request_template.md
gh api repos/OWNER/REPO/contents/PULL_REQUEST_TEMPLATE.md
gh api repos/OWNER/REPO/contents/docs/PULL_REQUEST_TEMPLATE.md
gh api repos/OWNER/REPO/contents/.github/PULL_REQUEST_TEMPLATE
gh api repos/OWNER/REPO/contents/.github/ISSUE_TEMPLATE
gh api repos/OWNER/REPO/contents/.github/ISSUE_TEMPLATE/config.yml
```

Also remember GitHub may inherit community health files from the owner-level `.github` repo. If repo-local templates are absent, check:

```bash
gh api repos/OWNER/.github/contents/.github/PULL_REQUEST_TEMPLATE.md
gh api repos/OWNER/.github/contents/.github/ISSUE_TEMPLATE
gh api repos/OWNER/.github/contents/.github/ISSUE_TEMPLATE/config.yml
```

Look for:

- remaining GitHub core rate limit
- how fast recent issues/PRs are arriving
- whether PR templates or issue templates exist locally or via owner-level community health defaults
- repeated PR title prefixes like `[foo]`, `release`, `docs`, `bump version`, `post-release`
- boilerplate headings in PR templates that should become cleanup patterns

### 5) Choose scrape defaults conservatively

Do **not** jump straight to an unbounded historical scrape unless the user explicitly wants it and has plenty of rate-limit headroom.

Use a staged recommendation:

#### Stage A: smoke test

Use this when auth is unverified, the repo is high-volume, or the user wants a quick first pass.

Typical recommendation:

```bash
uv run slop-farmer scrape \
  --repo OWNER/REPO \
  --output-dir runs/REPO/data \
  --max-issues 200 \
  --max-prs 50 \
  --fetch-timeline
```

#### Stage B: recent bootstrap

Use this for high-volume repos after the smoke test succeeds.

Typical recommendation:

- `fetch-timeline: true`
- `issue-max-age-days: 30` to `60`
- `pr-max-age-days: 14` to `30`
- optionally cap `max-issues` and `max-prs` for the first real run

#### Stage C: recurring incremental runs

Once the user has a healthy snapshot directory, recommend resuming from the watermark and dropping most manual caps unless the repo is extremely busy.

### 6) Config authoring rules

Start from `references/config-template.yaml`.

Good defaults to lean toward:

- `workspace: runs/<repo-slug>`
- `scrape.fetch-timeline: true`
- `dashboard.window_days: 60`
- `dashboard.contributor_window_days: 60`
- `dashboard.contributor_max_authors: 0`
- `analysis.max_clusters: 10`

For analysis backend:

- Prefer `deterministic` if the user has not configured a fast-agent/provider model key yet.
- Use `hybrid` only if the surrounding model setup is already in place or the user explicitly wants it.

When you generate config, explain each non-obvious field in one line.

### 7) Boilerplate cleanup and suppression guidance

Separate three different things clearly:

1. **PR body cleanup** — currently supported via `pull-requests.template_cleanup`
2. **PR cluster suppression** — currently supported via `pull-requests.cluster_suppression_rules`
3. **Issue boilerplate cleanup** — currently mostly advisory/manual unless you verify code support

For PR cleanup:

- Convert template headings into regexes for `section_patterns` or `line_patterns`
- Do **not** strip `Summary` by default just because it is common. In some repos, including OpenClaw, the summary section carries the most PR-specific signal and stripping it can worsen PR-text over-clustering.
- Prefer stripping low-signal checklist/compliance sections like change-type, scope, review/verification, risk, migration, and recovery blocks before removing substantive summary/problem text.
- Keep `mode: merge_defaults` unless the repo needs a total replacement
- Keep `strip_html_comments: true`
- Keep `trim_closing_reference_prefix: true`

For suppression rules:

- Use repeated low-signal routine categories from recent PR titles
- Examples: release workflow churn, version bumps, routine docs sweeps, mechanical post-release PRs
- Prefer precise title patterns over broad catchalls

For issues:

- Inspect issue templates and recent issue titles
- Call out repetitive prefixes or canned text
- If issue forms (`.yml`) or `config.yml` exist, summarize them because they help identify the effective issue intake flow
- Be explicit that this is for manual review or future implementation if there is no config hook yet

### 7.5) Optional fast-agent synthesis branch

If `fast-agent` is installed and the relevant model secret is present, you can upgrade the heuristic pass into a model-assisted synthesis pass.

Recommended flow:

1. Run the deterministic extractor first:

```bash
uv run skills/farmer-setup/scripts/suggest_setup_config.py owner/name --prepare-fast-agent
```

2. If the model requires `HF_TOKEN` and it is present, run:

```bash
uv run skills/farmer-setup/scripts/suggest_setup_config.py owner/name --use-fast-agent --fast-agent-model kimi25
```

Notes:

- The script uses `fast-agent check models --for-model <model> --json` to discover required secret env var names.
- The synthesis flow now writes a JSON Schema artifact and runs `fast-agent go --no-env --model kimi25 --prompt-file ... --json-schema ... --quiet`.
- This is cleaner and more reliable than prompt-only “please emit JSON” guidance because fast-agent validates the output shape before returning it.
- `--shell` is optional and usually unnecessary here because the script has already collected the evidence.
- Prefer the deterministic branch as the source of truth and use the model branch for regex/rule synthesis.

### 8) Hugging Face setup guidance

For local interactive usage:

```bash
python -m pip install -U huggingface_hub
hf auth login
hf auth whoami
```

For CI/headless/Space-style usage:

```bash
export HF_TOKEN=...
hf auth whoami
```

If the user wants both dataset publishing and dashboard deployment, remind them they need:

- a dataset repo id like `USER/repo-pr`
- a Space id like `USER/repo-dashboard`
- and, for non-static Spaces, secrets/env vars can be provisioned with the HF CLI at repo creation time

### 7.6) Tuning pass: use deterministic analysis to catch over-clustering

After you draft repo-specific PR cleanup rules, do a tuning pass before treating the config as production-ready.

Recommended loop:

1. Scrape a recent bootstrap snapshot.
2. Run deterministic analysis on that snapshot.
3. Inspect the largest meta-bug clusters.
4. If you see a very large **PR-only** cluster dominated by `soft_similarity`, check whether it is:
   - missing a corresponding `pr-scope` code cluster
   - weakly or misleadingly anchored on an issue target
   - full of generic PR-template sections rather than shared code or issue linkage
5. Tighten `pull-requests.template_cleanup` and rerun deterministic analysis before switching back to hybrid for publishing.

Useful commands:

```bash
uv run slop-farmer --config configs/REPO.yaml scrape
uv run slop-farmer --config configs/REPO.yaml analyze --ranking-backend deterministic
uv run slop-farmer --config configs/REPO.yaml pr-scope
```

Heuristic for likely PR-template over-clustering:

- a huge PR-only meta-bug cluster from deterministic analysis
- `evidence_types` dominated by `soft_similarity`
- no comparable `pr-scope` cluster
- many PRs sharing compliance/checklist sections but not code areas

If that happens:

- keep the real problem summary text when possible
- do **not** strip `Summary` blindly
- strip lower-signal sections first:
  - change type / scope
  - regression / verification
  - review conversations
  - compatibility / migration
  - failure recovery
  - risks / mitigations

Verified useful HF Space commands:

```bash
hf repos create USER/my-space --type space --space-sdk static --exist-ok
hf repos create USER/my-space --type space --space-sdk docker --secrets HF_TOKEN --secrets GITHUB_TOKEN
hf repos create USER/my-space --type space --space-sdk docker --secrets-file .space.secrets --env-file .space.env
hf spaces info USER/my-space --format json
hf upload USER/my-space ./dist . --repo-type space
```

Useful API-Space volume/bootstrap commands:

```bash
python - <<'PY'
from huggingface_hub import HfApi
api = HfApi()
api.create_bucket("USER/my-api-space-data", exist_ok=True)
PY

hf spaces volumes set USER/my-api-space -v hf://buckets/USER/my-api-space-data:/data
hf spaces volumes ls USER/my-api-space
```

Important notes:

- HF buckets are **not** created with `hf repo create`; use `HfApi.create_bucket(...)`.
- For writable bucket mounts, use `hf://buckets/USER/bucket-name:/mount/path`.
- Dataset repo remains canonical published state; mounted bucket is mutable operational cache only.

### 9) Reliable deployment architecture

Recommend this as the default robust setup:

- **Runner:** local cron, GitHub Actions, or another scheduler
- **Source of truth:** Hugging Face dataset repo published by `refresh-dataset` plus `publish-analysis-artifacts`
- **Presentation:** static Hugging Face Space deployed by `deploy-dashboard`
- **Optional read API:** Docker Space backed by the same dataset repo, with an optional writable HF bucket at `/data`

Explain why this is better than “run everything inside a Space”:

- Spaces are best as published app artifacts
- dataset repos preserve durable snapshots
- scheduled scraping and dashboard rebuilds are easier to reason about outside the serving layer

If the user specifically insists on running the pipeline inside a Space, be honest:

- it is possible with a Docker/custom Space plus secrets and startup scripting
- HF CLI supports `--secrets`, `--secrets-file`, `--env`, and `--env-file` when creating a Space
- but it is less robust as a recurring data pipeline than external scheduling + dataset + static Space

### 10) Command sequences you should prefer

#### Config-driven local workflow

```bash
uv run slop-farmer --config configs/my-repo.yaml refresh-dataset
uv run slop-farmer --config configs/my-repo.yaml analyze
uv run slop-farmer --config configs/my-repo.yaml publish-analysis-artifacts \
  --analysis-id hybrid-gpt54mini-v3 \
  --canonical
uv run slop-farmer --config configs/my-repo.yaml pr-scope
uv run slop-farmer --config configs/my-repo.yaml new-contributor-report
uv run slop-farmer --config configs/my-repo.yaml deploy-dashboard --refresh-contributors
```

Dashboard default note:

- if `--analysis-input` is omitted, dashboard export now prefers published `analysis/current/`
- snapshot-local analysis is only a fallback when canonical current analysis is absent

## Output requirements

When producing a setup answer, include:

- the exact env var names: `GITHUB_TOKEN` and `HF_TOKEN`
- a ready-to-save YAML config
- at least one recommended smoke-test command
- at least one recommended recurring/deployment command
- an explicit note if issue cleanup guidance is advisory rather than currently configurable

## Avoid

- Do not invent unsupported config keys for issue cleanup.
- Do not say GitHub write access is needed for scraping.
- Do not recommend fully unbounded historical scraping by default on busy repos.
- Do not blur together “publish dataset” and “deploy dashboard”; they are separate steps.
- Do not describe Spaces as the only robust place to run scheduled data collection.
- Do not claim template discovery is always repo-local; owner-level `.github` community health files may be the effective source.