fully deprecate old data generation system by shanjiaz · Pull Request #433 · vllm-project/speculators

shanjiaz · 2026-04-17T16:12:06Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Removed the old data generation system and all references. We should wait the new examples + docs to land before deprecating the old system

Description

Blocked by the llm-compressor-testing PR that removes the old datagen workflow.

Removed old data generation system and related scripts/infrastructure.
Moved preprocessing related tests to integration and removed old data generation related tests.
Removed examples that use the old e2e flow.

Related Issue

Tests

Nightly run with the new system: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/24739445799/job/72374945272

I have filled in:

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan/results, such as providing test command and pasting the results.
(Optional) The necessary documentation update.
I (a human) have written or reviewed the code in this pr to the best of my ability.

coderabbitai · 2026-04-17T16:12:18Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c917c4a9-6300-4905-9574-a4a15f83127c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

Remove legacy data generation infrastructure including vLLM in-process generators, end-to-end orchestrator wrapper, example training scripts, and related configuration/testing code. Refactor data_generation_offline.py to use async endpoint-based hidden-states generation with safetensor output, replacing the prior synchronous vLLM generator approach. Update documentation and tests accordingly.

Changes

Cohort / File(s)	Summary
Documentation & Examples Removal `README.md`, `scripts/README.md`, `examples/data_generation_and_training/README.md`, `examples/data_generation_and_training/*.py`	Removed end-to-end training examples (Llama3, Qwen3, GPT-OSS) and all training workflow documentation. Includes three example scripts (`llama3_8b_sharegpt_5k.py`, `gpt_oss_20b_ultrachat_5k.py`, `qwen3_8b_sharegpt_ultrachat.py`) that invoked `run_e2e` orchestrator.
Docs Navigation & Generation `docs/examples/index.md`, `docs/index.md`, `docs/scripts/gen_files.py`	Updated documentation structure: removed "Train" card from examples index, removed extra `data_generation_and_training.md` link from main docs, and switched `train.md` source from `scripts/README.md` to `examples/ONLINE_TRAINING.md`.
Legacy Data Generation Pipeline `scripts/data_generation_offline2.py`, `scripts/gen_and_train.py`, `src/speculators/data_generation/vllm_hidden_states_generator.py`, `src/speculators/data_generation/config_generator.py`, `src/speculators/data_generation/custom_worker.py`	Removed in-process vLLM hidden-states generator (`VllmHiddenStatesGenerator` class with VLLM config/execution logic), orchestration wrapper (`gen_and_train.py` that coordinated data generation→vocab mapping→training steps), and configuration/metadata capture infrastructure (`config_generator.py` with `DataGenerationConfig` dataclass, `custom_worker.py` with tensor interception).
Data Generation Refactor `scripts/data_generation_offline.py`	Major architectural shift: replaced synchronous vLLM in-process hidden-states extraction with async endpoint-based pipeline. Now loads preprocessed dataset, queries vLLM server via `openai.AsyncOpenAI`, writes `hs_.safetensors` per-sample outputs. Adds resume support via `hs_.safetensors` index scanning, async concurrency with semaphores, configurable error handling (`--fail-on-error`, `--max-consecutive-errors`), and optional output validation. Removed CLI args for vLLM config (model path, GPU memory, tensor parallelism) and HuggingFace preprocessing; added endpoint/concurrency/retry/validation options.
Test Infrastructure `tests/datagen/test_config_generator.py`, `tests/datagen/test_vllm_hidden_states.py`, `tests/e2e/smoke/test_offline_training.py`, `tests/e2e/smoke/test_resume_optimizer.py`, `tests/e2e/regression/test_eagle3_offline_acceptance.py`, `tests/e2e/utils.py`	Removed unit tests for deleted `config_generator.py` and `vllm_hidden_states_generator.py` modules (GPU accuracy/consistency regression tests). Updated e2e test references: renamed `run_data_generation_offline2()` to `run_data_generation_offline()` in utils and updated all call sites in smoke/regression tests.
Configuration & Lint Rules `.coderabbit.yaml`, `pyproject.toml`	Removed CodeRabbit review guidance for `gen_and_train.py` pipeline verification. Removed Ruff per-file ignore rule for `scripts/gen_and_train.py` T201 (print statements).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Expand E2E testing #388: Directly modifies the e2e testing harness and data-generation helper functions with overlapping changes to test imports and function call rewiring between run_data_generation_offline2 and run_data_generation_offline.
Add e2e smoke tests for the new datagen system #378: Related modifications to vLLM-based data-generation scripts and e2e test infrastructure for offline/online datagen and vLLM launch utilities.

Suggested reviewers

fynnsu
dsikka

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: removing and deprecating the old data generation system, which aligns with the substantial deletions and refactoring across the codebase.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description accurately describes the removal of old data generation system, related scripts, and example references, aligning with the substantial changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch deprecate-old-datagen

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mergify · 2026-04-17T16:12:43Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

dsikka

can we make sure to remove the vllm install in the pyproject.toml file. We'll need to verify with Dan that it still gets installed for our e2e tests

mergify · 2026-04-19T14:59:01Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Require two reviews

Wonderful, this rule succeeded.

PRs labelled "two-reviews" must have at least two approving reviews before merging.

#approved-reviews-by >= 2

mergify · 2026-04-20T17:52:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shanjiaz.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-04-21T00:09:12Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

github-actions · 2026-04-21T00:18:58Z

~~Link Check Results (DOCS)~~

All links are now valid - this issue has been resolved.

Marked as resolved: 2299ca5

github-actions · 2026-04-21T00:19:02Z

~~Link Check Results (REPO)~~

All links are now valid - this issue has been resolved.

Marked as resolved: 2299ca5

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

scripts/data_generation_offline.py (1)
340-354: Minor: busy-wait on QueueFull could be replaced with cancellable queue.put.

The put_nowait + await asyncio.sleep(0.1) pattern adds up to ~100 ms of latency per backpressure hit and still requires an extra loop. A cleaner alternative is await asyncio.wait([queue.put(...), cancel_event.wait()], return_when=FIRST_COMPLETED) (cancelling the losing task). Not strictly necessary for correctness, but it removes the polling delay and simplifies the cancellation path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/data_generation_offline.py` around lines 340 - 354, The _feed_queue
function currently busy-waits on queue.put_nowait with a sleep loop; replace
that loop with a cancellable await pattern: create two awaitables —
queue.put({"idx": i, "input_ids": item["input_ids"]}) and cancel_event.wait() —
use asyncio.wait(..., return_when=asyncio.FIRST_COMPLETED) to wait for whichever
completes, then if cancel_event won the race break, otherwise ensure you cancel
the pending cancel_event.wait() task (or the pending put task) to avoid leaks
and proceed; remove the asyncio.sleep polling and keep the outer for-loop and
cancel_event.is_set() checks intact so the behavior of _feed_queue, queue,
cancel_event, to_process and dataset is preserved.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/data_generation_offline.py`:
- Around line 110-119: The CLI flag --layer-ids (args.layer_ids) is parsed but
never used; either propagate it to the client/server request as the vLLM
target-layer parameter or remove the flag to avoid misleading users. Fix option
A: locate where the request payload is constructed (e.g., the function building
the inference/request JSON or the client request call) and add a field like
"target_layer_ids": args.layer_ids (or map to the server's --target-layer-ids
naming) so the server receives the override; ensure any async request builder or
send_request function accepts and forwards this value. Fix option B: delete the
parser.add_argument("--layer-ids", ...) entry and associated help text to remove
the unused flag. Also update any help text or docs and keep the symbol names
args.layer_ids and --target-layer-ids consistent.
- Around line 146-154: The --max-retries CLI option is ignored because
generate_hidden_states_async has no retry logic and the openai.AsyncOpenAI
client is created with max_retries=0; update the worker path to implement an
explicit retry loop that uses the parsed max_retries value: pass the CLI
max_retries into the worker invocation (and into any call sites of
generate_hidden_states_async), wrap the call to generate_hidden_states_async in
a retry loop that retries up to max_retries on transient failures, and ensure
retries interact with _FailureTracker and the --fail-on-error semantics so final
post-retry outcomes are what _FailureTracker records; alternatively, if you
prefer not to implement retries, remove the parser.add_argument("--max-retries",
...) to avoid misleading users.
- Around line 414-419: The ValueError message built when checking args.model vs
model_id is using an f-string only on the first literal and concatenating
adjacent string literals without spaces, so {model_id} is not interpolated and
words run together; update the ValueError in the model-check block (the
args.model/model_id comparison) to use a single f-string (or format call) that
includes {model_id} and proper spacing/punctuation so the actual model_id value
appears in the error message when raising ValueError.
- Around line 447-452: The summary log currently prints args.output which can be
None; change the logger call in the end of the processing block to use the
actual resolved hidden_states_dir variable (the directory created/returned by
generate_and_save_hidden_states) so the message reads "Saved X new data points
to <hidden_states_dir>" and similarly ensure any related warning/log about
skipped samples references hidden_states_dir when appropriate; locate the
logger.info call that uses args.output and replace it with hidden_states_dir
(and adjust scope if hidden_states_dir is returned/available in that function).
- Around line 317-323: Replace the hard process termination in the worker except
block with cooperative cancellation and exception propagation: on exception in
the worker (the except Exception as e block) call cancel_event.set(), log the
exception with logger.exception, record the exception in the shared
worker-exception container used by _shutdown_workers (or push it to a
thread-safe queue/list that _shutdown_workers checks), and then return/raise an
asyncio.CancelledError so the worker exits cleanly; rely on _shutdown_workers to
detect and re-raise the first non-cancellation exception and let main() (and
asyncio.run) perform the final sys.exit(1) and proper async/context cleanup
(this avoids calling os._exit(1) and ensures AsyncOpenAI context managers and
tqdm/atexit handlers run their teardown).

---

Nitpick comments:
In `@scripts/data_generation_offline.py`:
- Around line 340-354: The _feed_queue function currently busy-waits on
queue.put_nowait with a sleep loop; replace that loop with a cancellable await
pattern: create two awaitables — queue.put({"idx": i, "input_ids":
item["input_ids"]}) and cancel_event.wait() — use asyncio.wait(...,
return_when=asyncio.FIRST_COMPLETED) to wait for whichever completes, then if
cancel_event won the race break, otherwise ensure you cancel the pending
cancel_event.wait() task (or the pending put task) to avoid leaks and proceed;
remove the asyncio.sleep polling and keep the outer for-loop and
cancel_event.is_set() checks intact so the behavior of _feed_queue, queue,
cancel_event, to_process and dataset is preserved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a40e290b-8621-4c71-bb80-3f7188755440

📥 Commits

Reviewing files that changed from the base of the PR and between 960263b and 4f15058.

📒 Files selected for processing (26)

.coderabbit.yaml
README.md
docs/examples/index.md
docs/index.md
docs/scripts/gen_files.py
examples/data_generation_and_training/README.md
examples/data_generation_and_training/gpt_oss_20b_ultrachat_5k.py
examples/data_generation_and_training/llama3_8b_sharegpt_5k.py
examples/data_generation_and_training/qwen3_8b_sharegpt_ultrachat.py
pyproject.toml
scripts/README.md
scripts/data_generation_offline.py
scripts/data_generation_offline2.py
scripts/gen_and_train.py
src/speculators/data_generation/config_generator.py
src/speculators/data_generation/custom_worker.py
src/speculators/data_generation/vllm_hidden_states_generator.py
tests/datagen/test_config_generator.py
tests/datagen/test_vllm_hidden_states.py
tests/e2e/regression/test_eagle3_offline_acceptance.py
tests/e2e/smoke/test_offline_training.py
tests/e2e/smoke/test_resume_optimizer.py
tests/e2e/utils.py
tests/integration/datagen/__init__.py
tests/integration/datagen/test_preprocessing.py
tests/integration/datagen/test_regex_patterns.py

💤 Files with no reviewable changes (15)

docs/examples/index.md
pyproject.toml
README.md
examples/data_generation_and_training/llama3_8b_sharegpt_5k.py
src/speculators/data_generation/custom_worker.py
examples/data_generation_and_training/qwen3_8b_sharegpt_ultrachat.py
examples/data_generation_and_training/gpt_oss_20b_ultrachat_5k.py
src/speculators/data_generation/vllm_hidden_states_generator.py
scripts/gen_and_train.py
scripts/data_generation_offline2.py
tests/datagen/test_config_generator.py
examples/data_generation_and_training/README.md
tests/datagen/test_vllm_hidden_states.py
scripts/README.md
src/speculators/data_generation/config_generator.py

rahul-tuli

LGTM pending removal of vllm dependency from pyproject.toml

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED. ## Purpose Removed the old data generation system and all references. We should wait the new examples + docs to land before deprecating the old system  ## Description Blocked by the llm-compressor-testing [PR](neuralmagic/llm-compressor-testing#261) that removes the old datagen workflow. - Removed old data generation system and related scripts/infrastructure. - Moved preprocessing related tests to `integration` and removed old data generation related tests. - Removed examples that use the old e2e flow.  ## Related Issue  ## Tests Nightly run with the new system: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/24739445799/job/72374945272  I have filled in: - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [x] The test plan/results, such as providing test command and pasting the results. - [ ] (Optional) The necessary documentation update. - [x] I (a human) have written or reviewed the code in this pr to the best of my ability. --------- Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

mergify Bot added documentation Improvements or additions to documentation quality-failed and removed quality-failed labels Apr 17, 2026

dsikka reviewed Apr 18, 2026

View reviewed changes

shanjiaz added the two-reviews label Apr 19, 2026

mergify Bot added the needs-rebase label Apr 20, 2026

shanjiaz force-pushed the deprecate-old-datagen branch from eeae10a to 8b6f649 Compare April 21, 2026 00:08

mergify Bot added quality-failed and removed needs-rebase quality-failed labels Apr 21, 2026

shanjiaz marked this pull request as ready for review April 21, 2026 13:18

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread scripts/data_generation_offline.py

Comment thread scripts/data_generation_offline.py

Comment thread scripts/data_generation_offline.py

Comment thread scripts/data_generation_offline.py

Comment thread scripts/data_generation_offline.py

rahul-tuli reviewed Apr 21, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

fynnsu approved these changes Apr 21, 2026

View reviewed changes

shanjiaz added 7 commits April 21, 2026 13:01

fully deprecate old data generation system

99f1665

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

min diff

fee40d0

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

move datagen tests to integration

3945770

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

min diff

deaab3a

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

fix doc failure

38df3d3

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

fix links

aeb42aa

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

remove vllm dependency

6c7c093

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

shanjiaz force-pushed the deprecate-old-datagen branch from 960d8a4 to 6c7c093 Compare April 21, 2026 17:01

Merge branch 'main' into deprecate-old-datagen

92f1856

shanjiaz requested review from dsikka and rahul-tuli April 21, 2026 17:33

rahul-tuli approved these changes Apr 21, 2026

View reviewed changes

shanjiaz added 2 commits April 21, 2026 14:06

Merge branch 'main' into deprecate-old-datagen

2977d6f

Merge branch 'main' into deprecate-old-datagen

2299ca5

shanjiaz enabled auto-merge (squash) April 21, 2026 20:01

shanjiaz merged commit 8fdee2d into main Apr 21, 2026
14 of 15 checks passed

shanjiaz deleted the deprecate-old-datagen branch April 21, 2026 20:04

coderabbitai Bot mentioned this pull request Apr 21, 2026

docs: restructure documentation to align with vLLM format #432

Merged

13 tasks

k-l-lambda mentioned this pull request Jun 8, 2026

Kimi K2.x online cross-node datagen over RDMA (retain in-process hidden-states generator) novitalabs/speculators#1

Open

Conversation

shanjiaz commented Apr 17, 2026 • edited by fynnsu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Description

Related Issue

Tests

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

mergify Bot commented Apr 17, 2026

Uh oh!

dsikka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Require two reviews

Uh oh!

mergify Bot commented Apr 20, 2026

Uh oh!

mergify Bot commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shanjiaz commented Apr 17, 2026 •

edited by fynnsu

Loading

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

dsikka left a comment •

edited

Loading

mergify Bot commented Apr 19, 2026 •

edited

Loading

github-actions Bot commented Apr 21, 2026 •

edited

Loading

github-actions Bot commented Apr 21, 2026 •

edited

Loading