Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 20 additions & 14 deletions docs/about/concepts/key-terminology.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,33 +2,39 @@

# Key Terminology

Essential vocabulary for agent training, RL workflows, and NeMo Gym. This glossary defines terms you'll encounter throughout the tutorials and documentation.
Essential vocabulary for model training, RL workflows, and NeMo Gym. This glossary defines terms you'll encounter throughout the tutorials and documentation.

## Rollout & Data Collection Terms
```{glossary}
Rollout / Trajectory
A complete sequence of agent-environment interactions, from initial prompt through tool usage to final reward score. The complete "story" of one agent attempt.
A complete sequence of model, user, and environment interactions, from initial prompt through tool usage to final reward score. The complete "story" of one attempt at a task.

Rollout Batch
A collection of multiple rollouts generated together, typically for the same task. Used for efficient parallel processing.

Environment
The conditions in which your model operates. Functionally, this typically refers to tools the model has access to.

Task
An input prompt paired with environment setup (tools + verification). What you want agents to learn to do.
An input prompt paired with environment setup (tools + verification). What you want models to learn to do.

Task Instance
A single rollout attempt for a specific task. Multiple instances per task capture different approaches.

Training environment
A set of tasks that share the same environment setup compiled into a single prompt dataset.

Trace
Detailed log of a rollout including metadata for debugging or interpretability.

Data Generation Process
The complete pipeline from input prompt to scored rollout, involving agent orchestration, model inference, tool usage, and verification.
The complete pipeline from input prompt to scored rollout, involving rollout orchestration, model inference, tool usage, and verification.

Rollout Collection
The process of applying your data generation pipeline to input prompts at scale.

Demonstration Data
Training data format for SFT consisting of input prompts paired with successful agent responses. Shows models examples of correct behavior.
Training data format for SFT consisting of input prompts paired with successful rollouts. Shows models examples of correct behavior.

Preference Pairs
Training data format for DPO consisting of the same prompt with two different responses, where one is preferred over the other.
Expand All @@ -47,30 +53,30 @@ Orchestration
Coordination logic that manages when to call models, which tools to use, and how to sequence multi-step operations.

Verifier
Component that scores agent outputs, producing reward signals. May also refer colloquially to "training environment with verifiable rewards."
Component that scores rollouts, producing reward signals. The word "verifier" may also refer colloquially to a different definition: "training environment with verifiable rewards."

Service Discovery
Mechanism by which distributed NeMo Gym components find and communicate with each other across machines.

Reward / Reward Signal
Numerical score (typically 0.0-1.0) indicating how well an agent performed on a task.
Numerical score (typically 0.0-1.0) indicating how well a task was accomplished.
```

## Training Approaches

```{glossary}
SFT (Supervised Fine-Tuning)
Training approach using examples of good agent behavior. Shows successful rollouts as training data.
Training approach using examples of good model behavior. Shows successful rollouts as training data.

DPO (Direct Preference Optimization)
Training approach using pairs of rollouts where one is preferred over another. Teaches better vs worse responses.

RL (Reinforcement Learning)
Training approach where agents learn through trial-and-error interaction with environments using reward signals.
Training approach where models learn through trial-and-error interaction with environments using reward signals.

Online vs Offline Training
- **Online**: Agent learns while interacting with environment in real-time (RL)
- **Offline**: Agent learns from pre-collected rollout data (SFT/DPO)
- **Online**: Model learns while interacting with environment in real-time (RL)
- **Offline**: Model learns from pre-collected rollout data (SFT/DPO)
```

## Interaction Patterns
Expand All @@ -80,17 +86,17 @@ Multi-turn
Conversations spanning multiple exchanges where context and state persist across turns.

Multi-step
Complex tasks requiring agents to break problems into sequential steps, often using tools and intermediate reasoning.
Complex tasks requiring models to break problems into sequential steps, often using tools and intermediate reasoning.

Tool Use / Function Calling
Agents invoking external capabilities (APIs, calculators, databases) to accomplish tasks beyond text generation.
Models invoking external capabilities (APIs, calculators, databases) to accomplish tasks beyond text generation.
```

## Technical Infrastructure

```{glossary}
Responses API
OpenAI's standard interface for agent interactions, including function calls and multi-turn conversations. NeMo Gym's native format.
OpenAI's standard interface for rollouts, including function calls and multi-turn conversations. NeMo Gym's native format.

Chat Completions API
OpenAI's simpler interface for basic LLM interactions. NeMo Gym includes middleware to convert formats.
Expand Down
4 changes: 4 additions & 0 deletions docs/about/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ Embedding custom training environments directly within training frameworks is co
- Interoperable with existing environments, systems, and RL training frameworks
- Growing collection of training environments and datasets for Reinforcement Learning from Verifiable Reward (RLVR)

:::{tip}
The name "NeMo Gym" comes from historical reinforcement learning literature, where the word "Gym" refers to a collection of RL training environments!
:::

## Core Components

A training environment consists of three server components:
Expand Down