Skip to content

Refactor dataset part to use the new dataset and processor interface #1552

@RayenTian

Description

@RayenTian

Refactor the SFT and GRPO entry script to adopt the new dataset processor interface introduced in RawDataset and the PROCESSOR_REGISTRY, removing ad‑hoc preprocessing and aligning with the GRPO data path.

Motivation

We recently introduced a unified processor interface and registration mechanism for datasets in PR !1506. The SFT path still wires processors and preprocessing via partials in run_sft.py, which diverges from the new interface and duplicates logic. Unifying both paths reduces maintenance, enables consistent configuration, and simplifies documentation.

Scope

  1. Replace ad‑hoc processor wiring in run_sft.py with the standard processor interface (TaskDataSpec + PROCESSOR_REGISTRY).
  2. Ensure datasets used by SFT can declare/select their processor consistently via config (e.g., data.processor).
  3. Make STF dataset inherit RawDataset class. Unified dataset interface.
  4. Keep parity with existing SFT features (optional BOS/EOS, add_generation_prompt, optional image preprocessing for CLEVR, etc.) by moving them into the processor layer where appropriate.
  5. Move customized OmegaConf op into a common place. Refer to this comment.
  6. Unify dataset initialize with super.init. Refer to this comment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions