cleanrl
$
npx mdskill add mkurman/zorai/cleanrlImplement reference deep reinforcement learning algorithms instantly.
- Provides standalone PPO, DQN, SAC, and TD3 code for prototyping.
- Supports Atari, MuJoCo, Procgen, PettingZoo, and JAX environments.
- Delivers research-friendly features within compact single-file scripts.
- Enables rapid experimentation with multi-agent and continuous actions.
SKILL.md
.github/skills/cleanrlView on GitHub ↗
---
name: cleanrl
description: Single-file deep reinforcement learning implementations (CleanRL). High-quality standalone implementations of PPO, DQN, C51, SAC, DDPG, TD3 with research-friendly features. Each algorithm is a self-contained file with ~300-500 lines. Includes Atari, MuJoCo, Procgen, PettingZoo multi-agent, and JAX variants. Use for RL algorithm reference, rapid prototyping, and understanding implementation details.
license: MIT license
tags: [ppo, dqn, sac, rl-reference-implementation, cleanrl]
metadata:
skill-author: K-Dense Inc.
--------|------|----------------|
| **PPO** | `ppo.py` | `python cleanrl/ppo.py --env-id CartPole-v1` |
| **PPO Atari** | `ppo_atari.py` | `python cleanrl/ppo_atari.py --env-id BreakoutNoFrameskip-v4` |
| **PPO Continuous** | `ppo_continuous_action.py` | `python cleanrl/ppo_continuous_action.py --env-id HalfCheetah-v4` |
| **PPO Multi-Agent** | `ppo_pettingzoo_ma_atari.py` | `python cleanrl/ppo_pettingzoo_ma_atari.py --env-id pong_v3` |
| **DQN** | `dqn.py` | `python cleanrl/dqn.py --env-id CartPole-v1` |
| **DQN Atari** | `dqn_atari.py` | `python cleanrl/dqn_atari.py --env-id BreakoutNoFrameskip-v4` |
| **C51 Atari** | `c51_atari.py` | `python cleanrl/c51_atari.py --env-id BreakoutNoFrameskip-v4` |
| **SAC Continuous** | `sac_continuous_action.py` | `python cleanrl/sac_continuous_action.py --env-id HalfCheetah-v4` |
| **SAC Atari** | `sac_atari.py` | `python cleanrl/sac_atari.py --env-id BreakoutNoFrameskip-v4` |
| **DDPG** | `ddpg_continuous_action.py` | `python cleanrl/ddpg_continuous_action.py --env-id HalfCheetah-v4` |
| **TD3** | `td3_continuous_action.py` | `python cleanrl/td3_continuous_action.py --env-id HalfCheetah-v4` |
### 3. PPO Training Workflow
```bash
# Minimal PPO on CartPole
python cleanrl/ppo.py \
--seed 1 \
--env-id CartPole-v1 \
--total-timesteps 50000 \
--track \
--wandb-project-name my-project
# PPO on Atari (standard config)
python cleanrl/ppo_atari.py \
--seed 1 \
--env-id BreakoutNoFrameskip-v4 \
--total-timesteps 10000000 \
--track \
--capture-video
# PPO on MuJoCo continuous control
python cleanrl/ppo_continuous_action.py \
--seed 1 \
--env-id HalfCheetah-v4 \
--total-timesteps 1000000
```
**Key PPO Hyperparameters:**
| Parameter | CartPole/Classic | Atari | MuJoCo |
|-----------|-----------------|-------|--------|
| `--total-timesteps` | 50K | 10M | 1M |
| `--learning-rate` | 2.5e-4 | 2.5e-4 | 3e-4 |
| `--num-envs` | 4 | 8 | 1 |
| `--num-steps` | 128 | 128 | 2048 |
| `--anneal-lr` | True | True | False |
| `--gae-lambda` | 0.95 | 0.95 | 0.95 |
| `--update-epochs` | 4 | 4 | 10 |
| `--norm-adv` | True | True | True |
| `--clip-coef` | 0.2 | 0.1 | 0.2 |
| `--ent-coef` | 0.01 | 0.01 | 0.0 |
### 4. DQN Training
```bash
# DQN on Atari
python cleanrl/dqn_atari.py \
--seed 1 \
--env-id BreakoutNoFrameskip-v4 \
--total-timesteps 10000000 \
--buffer-size 100000 \
--learning-starts 80000 \
--target-network-frequency 1000 \
--batch-size 32 \
--track
```
### 5. Multi-Agent RL with PettingZoo
```bash
# PPO on multi-agent Atari Pong
python cleanrl/ppo_pettingzoo_ma_atari.py \
--seed 1 \
--env-id pong_v3 \
--total-timesteps 10000000 \
--track
# Available MA environments:
# pong_v3, surround_v2, tennis_v3, space_invaders_v2,
# warlords_v3, combat_plane_v2, combat_tank_v2
```
### 6. Logging and Monitoring
```bash
# TensorBoard (runs in cleanrl/runs/)
tensorboard --logdir runs
# Weights & Biases (requires wandb login)
python cleanrl/ppo.py --track --wandb-project-name my-project --wandb-entity my-entity
# Video capture (every 100th evaluation)
python cleanrl/ppo_atari.py --capture-video --env-id BreakoutNoFrameskip-v4
```
### 7. JAX-Accelerated Variants
5-10x faster training via JAX compilation + EnvPool:
```bash
# Install JAX support
pip install -r requirements/requirements-jax.txt
# JAX PPO on Atari (ultra-fast)
python cleanrl/ppo_atari_envpool_xla_jax.py \
--env-id BreakoutNoFrameskip-v4 \
--total-timesteps 10000000
# JAX DQN on Atari
python cleanrl/dqn_atari_jax.py \
--env-id BreakoutNoFrameskip-v4 \
--total-timesteps 10000000
```
### 8. Docker and Cloud (AWS)
```bash
# Build Docker image
docker build -t cleanrl .
# Submit to AWS Batch
python cleanrl/ppo_atari.py \
--env-id BreakoutNoFrameskip-v4 \
--total-timesteps 10000000 \
--track \
--upload-model
```
### 9. Algorithm Structure (Reading an Implementation)
Each file follows a consistent structure:
```python
# 1. Imports
# 2. parse_args() — CLI arguments
# 3. make_env() — Environment creation
# 4. Agent class (if needed) — Neural network, usually simple MLP/CNN
# 5. main():
# a. Setup: seeding, device, envs
# b. Initialize agent, optimizer
# c. Initialize storage (rollout buffer, replay buffer)
# d. Training loop:
# - Collect experience
# - Compute returns/advantages
# - Update policy/value/Q-network
# - Log metrics
# e. Save model, upload
```
Each file is ~300-500 lines and is meant to be read top-to-bottom.
### 10. Debugging and Development
```bash
# Minimal test run (fewer steps, more frequent logging)
python cleanrl/ppo.py \
--env-id CartPole-v1 \
--total-timesteps 5000 \
--num-envs 1 \
--num-steps 32 \
--track
# Disable wandb (pure TensorBoard)
python cleanrl/ppo.py --env-id CartPole-v1 --total-timesteps 50000
# Check available env IDs
python -c "import gymnasium as gym; print([e for e in gym.envs.registry if 'CartPole' in e])"
```
## Key Patterns
1. **CleanRL is NOT a library** — don't `import cleanrl`, run the scripts directly
2. **Each file is self-contained** — copy `ppo.py` and modify it for your research
3. **Use `--track` for W&B logging**, omit for plain TensorBoard
4. **`--capture-video` saves agent gameplay** — great for qualitative evaluation
5. **JAX variants are fastest** but require understanding of `jax.lax.scan`
6. **All implementations are benchmarked** — see https://benchmark.cleanrl.dev
## References
- [CleanRL Documentation](https://docs.cleanrl.dev/)
- [Algorithm Benchmarks](https://benchmark.cleanrl.dev/)
- [JMLR Paper](https://www.jmlr.org/papers/volume23/21-1342/21-1342.pdf)
- [CORL (offline RL fork)](https://github.com/corl-team/CORL)
- [LeanRL (optimized PyTorch fork)](https://github.com/pytorch-labs/LeanRL)
More from mkurman/zorai
- account-management>
- agile-scrum>
- albumentationsFast image augmentation library (Albumentations). 70+ transforms for classification, segmentation, object detection, keypoints, and pose estimation. Optimized OpenCV-based pipeline with unified API across all CV tasks. Supports images, masks, bounding boxes, and keypoints simultaneously. Note: classic Albumentations (MIT) is no longer maintained; successor AlbumentationsX uses AGPL-3.0. For torchvision-native augmentations, use torchvision.transforms.v2.
- aml-complianceAnti-Money Laundering (AML) and Know Your Customer (KYC) compliance workflow. Sanctions screening, PEP detection, transaction monitoring, suspicious activity reporting (SAR), and OFAC compliance.
- anki-connectThis skill is for interacting with Anki through AnkiConnect, and should be used whenever a user asks to interact with Anki, including to read or modify decks, notes, cards, models, media, or sync operations.
- approval-checkpoint-long-taskCanonical long-task pack for daemon-managed work with deliberate approval checkpoints, status summaries, rollback notes, and mobile-safe governance-aware updates.
- auditing-goal-artifactsUse when reviewing recent zorai goal run outputs, closure markers, ledgers, or evidence bundles to judge whether completion is credible or to identify remaining uncertainty.
- autogenAutoGen (Microsoft) — multi-agent conversation framework. Agent-to-agent chat, code generation & execution, tool use, group chat, and human-in-the-loop. Build collaborative AI systems with specialized agents.
- backtraderPython backtesting framework for trading strategies. Data feeds, brokers, analyzers, and live trading support. Strategy development with commission models, slippage, and signal-based execution.
- beautiful-mermaidRender Mermaid diagrams as SVG and PNG using the Beautiful Mermaid library. Use when the user asks to render a Mermaid diagram.