hf-space-recovery
$
npx mdskill add huggingface/OpenEnv/hf-space-recoveryUse this skill to recover OpenEnv Hub deployments quickly with minimal blast radius.
SKILL.md
.github/skills/hf-space-recoveryView on GitHub ↗
---
name: hf-space-recovery
description: Diagnose and recover failing or stuck Hugging Face Space deployments for OpenEnv environments. Use when deploying envs from `envs/` to the Hub (`openenv` namespace with version suffixes), when Spaces are in `BUILDING`/`APP_STARTING`/`RUNTIME_ERROR`, or when release collections need to be reconciled after targeted redeploys.
---
# HF Space Recovery
Use this skill to recover OpenEnv Hub deployments quickly with minimal blast radius.
## Execute This Workflow
### 1) Confirm release tuple
Use a single release tuple across all commands:
- Namespace: `openenv`
- Version: `vX.Y.Z`
- Space suffix: `-vX-Y-Z`
Default to a version suffix and treat unsuffixed Spaces as legacy.
### 2) Snapshot runtime status
Collect all versioned spaces and isolate non-running ones:
```bash
hf spaces ls --author openenv --limit 500 --expand=runtime \
| jq -r '.[] | select(.id|test("-v[0-9]+-[0-9]+-[0-9]+$")) \
| [.id, .runtime.stage, (.runtime.raw.errorMessage // "")] | @tsv' \
| sort
```
Treat `RUNNING` and `SLEEPING` as healthy. Triage everything else.
### 3) Classify and extract signal
- `RUNTIME_ERROR`: read traceback from `.runtime.raw.errorMessage`.
- `BUILD_ERROR`: read build error text from runtime info, then patch Dockerfile/deps.
- `APP_STARTING` longer than 10 minutes: inspect event stream and metrics before changing code.
```bash
hf spaces info openenv/<space-id> --expand=runtime
curl -sS -m 10 https://huggingface.co/api/spaces/openenv/<space-id>/events | sed -n '1,140p'
curl -sS -m 10 -i https://huggingface.co/api/spaces/openenv/<space-id>/metrics | sed -n '1,120p'
```
Read `references/troubleshooting.md` for symptom-to-fix mappings.
### 4) Apply minimal fix and targeted redeploy
Prefer targeted redeploys over full-fleet pushes:
```bash
scripts/prepare_hf_deployment.sh \
--hf-namespace openenv \
--env <env_name> \
--skip-collection
```
Use `openenv` CLI as a supplement, not a replacement, for release triage:
- Validate env layout quickly (`uv run openenv validate ...`) when applicable.
- Keep release deploys on `scripts/prepare_hf_deployment.sh` to preserve suffix/pinning behavior.
### 5) Unstick runtime when code is already good
If Space remains in `APP_STARTING` with no actionable error:
```bash
uv run --with huggingface_hub python - <<'PY'
from huggingface_hub import HfApi
api = HfApi()
api.restart_space("openenv/<space-id>", factory_reboot=True)
PY
```
If still stuck, force recreation as last resort:
```bash
hf repo delete openenv/<space-id> --repo-type space
scripts/prepare_hf_deployment.sh --hf-namespace openenv --env <env_name> --skip-collection
```
### 6) Verify and close
Verify both runtime stage and health endpoint:
```bash
hf spaces info openenv/<space-id> --expand=runtime
curl -sS -m 10 https://<space-subdomain>.hf.space/health
```
Then verify fleet-wide:
```bash
hf spaces ls --author openenv --limit 500 --expand=runtime \
| jq -r '.[] | select(.id|test("-v[0-9]+-[0-9]+-[0-9]+$")) \
| select(.runtime.stage!="RUNNING" and .runtime.stage!="SLEEPING") \
| [.id, .runtime.stage] | @tsv' | sort
```
### 7) Reconcile collection
When targeted deploys are done, update collection membership for the same version:
```bash
python3 scripts/manage_hf_collection.py \
--version vX.Y.Z \
--collection-namespace openenv \
--space-id openenv/<space-id>
```
Add one `--space-id` per redeployed space.
More from huggingface/OpenEnv
- alignment-reviewReview code changes for bugs and alignment with OpenEnv principles and RFCs. Use when reviewing PRs, checking code before commit, or when asked to review changes. Implements two-tier review model.
- deploy-hfDeploy an OpenEnv environment to Hugging Face Spaces. Use when asked to deploy, push to Hugging Face, or update a space.
- generate-openenv-envGenerate OpenEnv environments from a concrete use case (for example, "generate an env for the library textarena"). Use when asked to design or implement a new environment under envs/ by researching a target library/API, selecting matching OpenEnv examples, asking key implementation questions, and building models/client/server/openenv.yaml. Do not use for model training or evaluation tasks.
- implementMake tests pass. Invoke after /write-tests produces failing tests.
- openenv-cliOpenEnv CLI (`openenv`) for scaffolding, validating, building, and pushing OpenEnv environments.
- pre-submit-prValidate changes before submitting a pull request. Run comprehensive checks including lint, tests, alignment review, and RFC analysis. Use before creating a PR, when asked if code is ready for review, or before pushing for PR.
- releaseRelease workflow for deploying OpenEnv environments to Hugging Face Spaces and keeping canonical references in sync.
- rfc-checkDetermine if proposed changes require an RFC. Use when planning significant changes, before starting major work, or when asked whether an RFC is needed.
- simplifyRefactor code after tests pass. The "Refactor" phase of Red-Green-Refactor.
- sprintWork on a batch of GitHub issues in parallel using Agent Teams. Creates one worktree per issue with TDD enforcement, coordinates via a lead agent, then produces stacked PRs.