rl-post-training
$
npx mdskill add elizaOS/eliza/rl-post-trainingRL post-training optimizes a language model's policy using reward signals. The standard pipeline:
SKILL.md
.github/skills/rl-post-trainingView on GitHub ↗
--- name: rl-post-training description: Diagnostic guide for RL-based post-training of language models (GRPO, PPO, REINFORCE, DPO). Use proactively when debugging a training pipeline that shows no improvement, loss anomalies, reward stagnation, NaN gradients, or other unexpected behavior during reinforcement learning fine-tuning. Work through all pipeline stages — reward, advantages, log-probs, loss, generation/decoding — before concluding the diagnosis is complete; stopping after one or two fixes is a common failure mode. Covers log-probability math, advantage estimation, numerical stability, reward processing, and generation/decoding pipeline issues. --- # RL Post-Training — Concepts & Diagnostic Guide ## Core Concepts RL post-training optimizes a language model's policy using reward signals. The standard pipeline: ``` prompt → generate completions → score with reward → compute advantages → policy gradient update ``` Each stage has distinct failure modes. When a model "shows no improvement," the bug could be anywhere in this pipeline. ## Diagnostic Methodology When RL training produces no improvement, work through these stages in order. Each stage depends on the previous one being correct. ### Stage 1: Verify Reward Signal - Are rewards non-constant? If all rewards are identical, there is no learning signal. - Do rewards correlate with completion quality? Spot-check decoded completions against their scores. - Is the reward function being called on the correct text? Check that decoding/stripping preserves the content the reward function needs to evaluate. ### Stage 2: Verify Advantage Computation - Are advantages non-zero when rewards vary? If they collapse to ~0, the policy gradient vanishes. - Check the magnitude and dtype of every numerical-stability constant in the advantage path (additive epsilons, clipping bounds). Compare each to what the math requires. - Check the group size `G`. `G ≤ 2` makes `std` either undefined or extremely noisy. ### Stage 3: Verify Log-Probability Computation - Verify bounds: log-probs of valid tokens must be non-positive. - Compare your implementation against `F.log_softmax` on a small deterministic input — a numerical match rules out sign errors, wrong gathering axis, and off-by-one subtraction. - On a near-one-hot input, confirm the dominant token's log-prob is close to 0 (not close to the min). ### Stage 4: Verify Loss Computation - Is the loss changing across steps? Flat loss suggests zero gradients upstream. - Log the KL term and the policy-gradient term separately. Either dominating the other is diagnostic. - Check the fraction of clipped samples. Near-100% clipping means the clip range is starving the signal. ### Stage 5: Verify Generation and Decoding - Are padding and decoder artefacts stripped from the text the reward function sees? - Does every completion shape your model can emit survive the decoding path with non-empty output where a human would expect non-empty output? Include degenerate cases (no formatting markers, only a prefix, only a suffix) in the round-trip test. - Print or log a sample of the actual strings handed to the reward function — mismatches between "what the model generated" and "what the reward saw" are often visible at a glance. ## Fixing, Not Rewriting A bug in a branched function is a bug in one branch, not a verdict on the whole function. Before editing: - Enumerate the input shapes the function handles today and the output each branch produces. The branches exist because callers rely on them. - Identify which input-output pairs violate the intended contract. Those are the only branches you need to change. - If your diff collapses a multi-branch function to a one-liner, you have almost certainly broken a contract a different caller depends on. Re-read the call sites before committing. The same principle applies to epsilon values, sign conventions, and clipping bounds — if a constant looks wrong, replace it with a correct constant; don't remove the surrounding numerical-stability logic. ## Key Mathematical Invariants These invariants should always hold. If any is violated, there is a bug. | Invariant | What it means | How to check | |-----------|--------------|-------------| | `log_prob <= 0` | Log of a probability is non-positive | `assert (log_probs <= 1e-6).all()` | | log-probs match `F.log_softmax` | Manual implementation equals the reference | `torch.allclose(manual, F.log_softmax(...).gather(...))` | | `sum(softmax(logits)) == 1` | Probabilities sum to 1 | `assert torch.allclose(softmax.sum(-1), ones)` | | `0 < epsilon << 1` | Additive epsilons exist for numerical stability only | `assert 0 < epsilon < 1` | | `advantages != 0` when rewards vary | Non-constant rewards yield non-zero advantages | `assert advantages.abs().max() > 0.1` | | Decoding preserves content | Every generation shape the model emits survives the decode path with the content a human reader would expect | Round-trip representative samples; confirm non-empty where non-empty is expected | ## Common Pitfall Categories Detailed pitfall catalog with examples is in `references/common-pitfalls.md`. Summary: 1. **Sign errors in log-space** — log-softmax, KL divergence, DPO log-ratio 2. **Numerical stability constants out of range** — additive epsilons, clip bounds, temperature 3. **String processing in decoding** — cleanup that blanks shapes the rules didn't anticipate 4. **Reward / decoding format mismatch** — reward function sees text with the shape it needs removed 5. **Reference model drift** — reference model not frozen, or gradients flowing through it 6. **Gradient flow breakage** — detached tensors, in-place ops, `.item()`/numpy round-trips in the loss path 7. **Configuration misuse** — SFT-scale LR, KL coefficient dominating, clip range too tight, `G` too small ## Available References | File | Contents | When to load | |------|----------|-------------| | `references/common-pitfalls.md` | Catalog of pitfall categories with symptoms and detection strategies | When you have a suspicious area and want to match it against known pattern shapes | | `references/diagnostic-workflow.md` | Stage-by-stage diagnostic procedures with verification snippets | When you need guidance on which invariant to check in which order | | `scripts/verify_pipeline.py` | Runnable diagnostic that prints log-prob / advantage / decoding values on small fixed inputs so you can compare them against the invariants above | Before declaring a fix done: run this and inspect the output, paying attention to lines marked `?` |
More from elizaOS/eliza
- ac-branch-pi-modelAC branch pi-model power flow equations (P/Q and |S|) with transformer tap ratio and phase shift, matching `acopf-math-model.md` and MATPOWER branch fields. Use when computing branch flows in either direction, aggregating bus injections for nodal balance, checking MVA (rateA) limits, computing branch loading %, or debugging sign/units issues in AC power flow.
- academic-pdf-redactionRedact text from PDF documents for blind review anonymization
- ada-plan-view-accessibilityUse when checking simplified ADA-derived plan-view bathroom accessibility constraints such as turning space, door clear width, toilet centerline, grab bars, and lavatory knee/toe clearance.
- analyze-ciAnalyze failed GitHub Action jobs for a pull request.
- architectural-dxf-extractionUse when extracting plan-view architectural geometry from DXF files with semantic CAD layers, especially when outputs must normalize rooms, doors, fixtures, clearances, and grab bars into machine-checkable JSON.
- attitude-controller-plannerUse this skill when implementing the inner control loop for a quadrotor — attitude (roll/pitch/yaw) PID control and attitude planning (converting desired acceleration to desired Euler angles). Covers gain layout, integral reset pattern, and the attitude planner inverse kinematics.
- azure-bgpAnalyze and resolve BGP oscillation and BGP route leaks in Azure Virtual WAN–style hub-and-spoke topologies (and similar cloud-managed BGP environments). Detect preference cycles, identify valley-free violations, and propose allowed policy-level mitigations while rejecting prohibited fixes.
- box-least-squaresBox Least Squares (BLS) periodogram for detecting transiting exoplanets and eclipsing binaries. Use when searching for periodic box-shaped dips in light curves. Alternative to Transit Least Squares, available in astropy.timeseries. Based on Kovács et al. (2002).
- browser-testingVERIFY your changes work. Measure CLS, detect theme flicker, test visual stability, check performance. Use BEFORE and AFTER making changes to confirm fixes. Includes ready-to-run scripts: measure-cls.ts, detect-flicker.ts
- cache-policy-comparisonCompare and implement eviction policies (LRU, LFU, FIFO, S3FIFO, ARC) for bounded-capacity caches. Use when choosing or implementing an eviction policy for a buffer pool, page cache, CDN edge, or LLM KV cache, or when writing a replay simulator that supports multiple policies. Clarifies recency vs frequency semantics, queue topology, saturating counters, ghost buffers, and the second-chance rule that distinguishes modern FIFO-family policies from classic LRU.