trl

Name: trl
Author: elizaOS/eliza
$npx mdskill add elizaOS/eliza/trl
TRL is organized around a trainer hierarchy that extends Hugging Face `transformers.Trainer`.
SKILL.md
---
name: trl
description: Reference for the TRL (Transformer Reinforcement Learning) library codebase. Use proactively before reading or editing any file under `trl/` so you have the intended contracts and invariants in mind, not just what the current code says. Covers trainer hierarchy (SFT, DPO, GRPO, KTO), shared utility functions (selective_log_softmax, decode_and_strip_padding, padding helpers), configuration system, model wrappers, and how data flows through any TRL trainer.
---

# TRL Library Reference

## Package Structure

TRL is organized around a trainer hierarchy that extends Hugging Face `transformers.Trainer`.

```
trl/
├── trainer/
│   ├── grpo_trainer.py       # GRPOTrainer
│   ├── grpo_config.py        # GRPOConfig
│   ├── sft_trainer.py        # SFTTrainer (supervised fine-tuning)
│   ├── dpo_trainer.py        # DPOTrainer (direct preference optimization)
│   ├── kto_trainer.py        # KTOTrainer (Kahneman-Tversky optimization)
│   ├── online_dpo_trainer.py # OnlineDPOTrainer
│   ├── utils.py              # Shared utilities (log probs, decoding, padding)
│   └── ...
├── models/
│   └── modeling_value_head.py  # Value head for PPO-style trainers
├── data_utils.py
├── commands/                   # CLI entry points
└── ...
```

## Trainer Hierarchy

All TRL trainers extend `transformers.Trainer`:
```
transformers.Trainer
├── SFTTrainer          # Supervised fine-tuning
├── DPOTrainer          # Direct preference optimization
├── GRPOTrainer         # Group relative policy optimization
├── KTOTrainer          # Kahneman-Tversky optimization
└── OnlineDPOTrainer    # Online DPO
```

Each trainer overrides `compute_loss` with its specific objective, and RL-based trainers (GRPO, OnlineDPO) additionally override `training_step` to add a generation phase before the optimization step.

## Shared Utility Functions (trainer/utils.py)

These utilities are used across multiple trainers. Read the source before modifying; the contracts below are what callers rely on.

### `selective_log_softmax(logits, index)`
Memory-efficient per-token log-probability. Equivalent in value to `F.log_softmax(logits, dim=-1).gather(...)` at the selected token positions, but avoids materializing the full vocab-sized tensor.

**Contract:**
- Input: `logits [B, T, V]`, `index [B, T]`
- Output: `log_probs [B, T]`, each entry a valid log-probability (i.e. non-positive)
- Must agree with `F.log_softmax` to within numerical tolerance on the same inputs

### `decode_and_strip_padding(input_ids, tokenizer)`
Converts a batch of token ID tensors into the cleaned text strings that the reward function will score.

**Contract:**
- Input: `input_ids [B, T]`, tokenizer
- Output: `list[str]` of length `B`
- Strips padding and decoder artefacts
- Handles any reasoning-block conventions the library supports; the exact policy for complete, incomplete, and absent reasoning markers is defined in the implementation

### Other Utilities
- `pad` / `pad_to_length` — Pad tensors to equal or specific lengths
- Various tokenizer helpers for batch processing

## Configuration System

All TRL configs extend `transformers.TrainingArguments`. Each trainer adds its own fields:

| Config | Trainer | Key fields |
|--------|---------|------------|
| `SFTConfig` | SFTTrainer | `max_seq_length`, `packing`, `dataset_text_field` |
| `DPOConfig` | DPOTrainer | `beta`, `loss_type`, `reference_free` |
| `GRPOConfig` | GRPOTrainer | `num_generations`, `beta`, `epsilon`, `reward_functions` |
| `KTOConfig` | KTOTrainer | `beta`, `desirable_weight`, `undesirable_weight` |

## Available References

| File | Contents | When to load |
|------|----------|-------------|
| `references/trl-codebase.md` | Module-by-module guide to TRL source: detailed breakdown of each trainer, model wrappers, data utilities, and CLI commands | When navigating unfamiliar parts of TRL beyond the trainer layer, or when you need details about a specific non-GRPO trainer |