Skip to content

Conversation

@garrett4wade
Copy link
Collaborator

Motivation

AReaL is too heavy for AI researchers to use, understand, and develop with for several reasons. The most important issue is that its code architecture is system-centric rather than AI-centric — the RL algorithm workflow consists of multiple workers that run consecutive model function calls, neither of which are well-known concepts for AI researchers. As a result, users must first understand these concepts before they can develop workflows and algorithms for their own use cases.

Additionally, due to historical reasons, AReaL's code is not clean. There are large pieces of code inherited from previous projects that are not useful but significantly increase the burden on users and developers. Sometimes debugging is difficult even for core developers like myself.

Since the tools for building RL workflows are becoming increasingly mature, implementing a framework that achieves comparable efficiency requires much fewer lines of code. Now is the proper time to revisit the API design and distill the giant codebase into a neat and clean one. The distilled codebase does not need to be ultra-efficient. Instead, we want to deliver 90% functionality of the original AReaL while minimizing the lines of code and the burden on potential users. Our aim is to build an RL training framework that is fast to use, fast to read, and fast to execute. Here comes the lite version of AReaL — AReaLite.

AReaLite is the first step in AReaL's refactoring process. It is not only a standalone training library with shallow interfaces, but will also provide the core API definitions to be used by AReaL in the future. AReaL will essentially transform its current worker-based architecture into an AI-centric architecture like AReaLite. AReaL will extend AReaLite's APIs and implementations to support more backends for efficient large-scale training.

Expectations of AReaLite

Highlights

  • Fully asynchronous training with decoupled inference and training.
  • Elastic inference device scaling — users can shut down or launch more inference processes independently during training.
  • Full SFT/RL algorithmic functionality matching AReaL.
  • Arbitrary agentic rollout workflow customization in a single file.
  • Easy navigation to implementation references via Ctrl+click in VSCode.
  • Support for distributed launching with Ray/SLURM/torchrun.

AReaLite's Scope

  • Not bound to Ray.
  • Only supports SGLang and PyTorch FSDP2 with SPMD launching.
  • No customized data structures like SequenceSample. All data are PyTorch tensors.
  • Uses HuggingFace (models, datasets) and PyTorch (FSDP, data structures) as much as possible.

Architecture

Core Components

arealite/
├── api/           # Abstract interfaces and data structures
├── impl/          # Concrete implementations
├── cli/           # Command-line interfaces
├── config/        # Configuration templates
└── tests/         # Standalone test scripts

Data Flow Architecture

AReaLite uses an async producer-consumer pattern:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   LLM Servers   │◄──►│ Rollout Workers  │───►│   Data Buffer   │
│   (SGLang)      │    │  (Async Batch)   │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘
        ▲                                               │
        │                                               ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Checkpoints   │◄───│  FSDP Trainer    │◄───│ Training Loop   │
│                 │    │  (Sync Batch)    │    │                 │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Key Design Principles

1. AI-Centric API Design

Unlike the original AReaL's system-centric approach with workers and model functions, AReaLite uses familiar ML concepts:

  • Agent and Environment (from RL literature)
  • RolloutWorkflow (combines multiple agents and the environment to generate rollout data)
  • Trainer (from HuggingFace/PyTorch, fetches rollout data and updates model parameters)

2. Factory Pattern for Extensibility

Each major component uses a factory pattern for easy customization:

  • EngineFactory creates training backends
  • TrainerFactory creates training algorithms
  • RolloutWorkflowFactory creates rollout workflows

3. Configuration-Driven Architecture

All components are configured through dataclasses defined in cli_args.py, enabling:

  • Type-safe configuration
  • Easy CLI argument generation
  • Clear documentation of available options

Roadmap

  • Finalize API design. (In-progress)
  • Implement standalone SGLang server (impl/sglang_server.py).
  • Implement SGLang client generation (impl/sglang_client.py).
  • Rollout pipeline (tests/test_rollout.py).
  • SGLang rollout interruption.
  • Asynchronous RL system-wide utilities (e.g., RolloutController).
  • Various launching scripts: ray, torchrun, slurm.
  • FSDP2 engine with transformers models. (In-progress)
  • SFT trainer. (In-progress)
  • SGLang update weights. (In-progress)
  • PPO trainer. (In-progress)
  • Add benchmarking against original AReaL
  • CI and unittests.
  • Other RL algorithms (DPO, REINFORCE, etc.)
  • Support for multi-modal models
  • User guide for transitioning from v0.3.0.
  • Advanced agentic workflows (tool use, planning)
  • Examples of training GSM8K, TLDR, and a search agent.
  • Allow external persistent SGLang servers for debugging purposes.

┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Checkpoints │◄───│ FSDP Trainer │◄───│ Training Loop │
│ │ │ (Sync Batch) │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a main entry point annotation somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a data flow arch graph in the code walk-through example: https://github.com/inclusionAI/AReaL/blob/lite/docs/arealite/gsm8k_grpo.md

if config.type == "my_collector":
return MyCollector(...)
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing in production system often lacks is systematic support of reward model, verifier and generative reward model. From the RL point of view it can be encapsulated in the concept of environment but it is curbersome to integrate them. Method a bit here might be good for those who are using RL in production.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reward models can be implemented as additional TrainEngines, just like the reference model. Generative reward models is encapsulated in the RolloutWorkflow object (aka the previous RolloutCollector), e.g., the agent call use the same LLM but with different prompts to judge whether the previous generated answer is correct.

def __str__(self):
"""Returns compact string representation: 'Parallel(mp=X,pp=Y,dp=Z)'."""
return (
f"Parallel(mp={self.tensor_parallel_size},"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mp -> tp ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No "mp" any more.

c_clip: Optional[float] = field(
default=None,
metadata={
"help": "Dual clipping factor for policy ratio, must > 1.0. None disables dual clipping."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any data validation similar to PyDantic to check range of parameters here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll mark that.

# Licensed under the Apache License, Version 2.0

import abc
import asyncio
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw there is uvloop as the async manager. Does it actually used in the project

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to uvloop.run wherever using asyncio.run.

# Cleanup registry
try:
self.registry.unregister_server(self.server_id)
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch seems a bit too wide


return Trajectory(
prompt=env_option,
data=dict(rewards=torch.tensor(rewards), **pad_sequences_to_tensors(data)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**pad_sequences_to_tensors(data) in function? return data=data_dict? TensorDict is not that bad though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses TensorDict as the basic data structure

# Run batched rollout by submitting requests to LLM servers
trajs = self.rollout_controller.generate_batch(
batch_size=len(data),
env_options=data,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, very clear on prepare_batch and generate_batch

base_model_path=self.config.actor.path,
)

assert len(mb_stats) == self.config.ppo_n_minibatches
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw some hints here such as f"Micro batch mismatch, current {mb_stats} and config {self.config.ppo_n_minibatches}. Check your configuration. "

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, why this is empty

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not empty in the new lite branch. :)

futrime and others added 2 commits July 7, 2025 09:36
* ci: add test-arealite

* ci: add checkout before running test-arealite

* ci: add USERNAME

* ci: add test script

* ci: add GitHub mirror

* ci: fix typo

* ci: clone one commit

* ci: fix condition

* ci: set command timeout to 60m

* ci: enable pip cache

* ci: optimize container lifecycle

* ci: split into many stages

* ci(test-arealite): fix typo

* ci: fix wrong env

* ci: fix pytest

* ci: uninstall transformer-engine

* ci: uninstall transformer-engine

* ci: fix model paths

* ci: show stdout/stderr

* ci: fix not clean up

* ci: backup sglang

* ci: remove tmp repo dir when run

* ci: fix docker run exit 1 condition

* ci(test-arealite): limit the concurrency and extend command timeout
@garrett4wade
Copy link
Collaborator Author

@tsaoyu Hi Yu, thank you for the thorough review!

We're currently finalizing some internal discussions about the API design, and the final implementation will differ slightly from what's currently proposed. We'd like to defer addressing these specific issues for a few days while we complete those discussions.

Your feedback is valuable and we'll incorporate it into our revised implementation. We'd appreciate having you review again once the final version is ready.

Thanks for your patience!

@garrett4wade
Copy link
Collaborator Author

Closed since #154 has been merged.

@garrett4wade garrett4wade deleted the fw/refactor branch December 31, 2025 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants