Skip to content

fully deprecate old data generation system#433

Merged
shanjiaz merged 10 commits into
mainfrom
deprecate-old-datagen
Apr 21, 2026
Merged

fully deprecate old data generation system#433
shanjiaz merged 10 commits into
mainfrom
deprecate-old-datagen

Conversation

@shanjiaz

@shanjiaz shanjiaz commented Apr 17, 2026

Copy link
Copy Markdown
Collaborator

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Removed the old data generation system and all references. We should wait the new examples + docs to land before deprecating the old system

Description

Blocked by the llm-compressor-testing PR that removes the old datagen workflow.

  • Removed old data generation system and related scripts/infrastructure.
  • Moved preprocessing related tests to integration and removed old data generation related tests.
  • Removed examples that use the old e2e flow.

Related Issue

Tests

Nightly run with the new system: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/24739445799/job/72374945272

I have filled in:

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan/results, such as providing test command and pasting the results.
  • (Optional) The necessary documentation update.
  • I (a human) have written or reviewed the code in this pr to the best of my ability.

@coderabbitai

coderabbitai Bot commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c917c4a9-6300-4905-9574-a4a15f83127c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

Remove legacy data generation infrastructure including vLLM in-process generators, end-to-end orchestrator wrapper, example training scripts, and related configuration/testing code. Refactor data_generation_offline.py to use async endpoint-based hidden-states generation with safetensor output, replacing the prior synchronous vLLM generator approach. Update documentation and tests accordingly.

Changes

Cohort / File(s) Summary
Documentation & Examples Removal
README.md, scripts/README.md, examples/data_generation_and_training/README.md, examples/data_generation_and_training/*.py
Removed end-to-end training examples (Llama3, Qwen3, GPT-OSS) and all training workflow documentation. Includes three example scripts (llama3_8b_sharegpt_5k.py, gpt_oss_20b_ultrachat_5k.py, qwen3_8b_sharegpt_ultrachat.py) that invoked run_e2e orchestrator.
Docs Navigation & Generation
docs/examples/index.md, docs/index.md, docs/scripts/gen_files.py
Updated documentation structure: removed "Train" card from examples index, removed extra data_generation_and_training.md link from main docs, and switched train.md source from scripts/README.md to examples/ONLINE_TRAINING.md.
Legacy Data Generation Pipeline
scripts/data_generation_offline2.py, scripts/gen_and_train.py, src/speculators/data_generation/vllm_hidden_states_generator.py, src/speculators/data_generation/config_generator.py, src/speculators/data_generation/custom_worker.py
Removed in-process vLLM hidden-states generator (VllmHiddenStatesGenerator class with VLLM config/execution logic), orchestration wrapper (gen_and_train.py that coordinated data generation→vocab mapping→training steps), and configuration/metadata capture infrastructure (config_generator.py with DataGenerationConfig dataclass, custom_worker.py with tensor interception).
Data Generation Refactor
scripts/data_generation_offline.py
Major architectural shift: replaced synchronous vLLM in-process hidden-states extraction with async endpoint-based pipeline. Now loads preprocessed dataset, queries vLLM server via openai.AsyncOpenAI, writes hs_*.safetensors per-sample outputs. Adds resume support via hs_*.safetensors index scanning, async concurrency with semaphores, configurable error handling (--fail-on-error, --max-consecutive-errors), and optional output validation. Removed CLI args for vLLM config (model path, GPU memory, tensor parallelism) and HuggingFace preprocessing; added endpoint/concurrency/retry/validation options.
Test Infrastructure
tests/datagen/test_config_generator.py, tests/datagen/test_vllm_hidden_states.py, tests/e2e/smoke/test_offline_training.py, tests/e2e/smoke/test_resume_optimizer.py, tests/e2e/regression/test_eagle3_offline_acceptance.py, tests/e2e/utils.py
Removed unit tests for deleted config_generator.py and vllm_hidden_states_generator.py modules (GPU accuracy/consistency regression tests). Updated e2e test references: renamed run_data_generation_offline2() to run_data_generation_offline() in utils and updated all call sites in smoke/regression tests.
Configuration & Lint Rules
.coderabbit.yaml, pyproject.toml
Removed CodeRabbit review guidance for gen_and_train.py pipeline verification. Removed Ruff per-file ignore rule for scripts/gen_and_train.py T201 (print statements).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Expand E2E testing #388: Directly modifies the e2e testing harness and data-generation helper functions with overlapping changes to test imports and function call rewiring between run_data_generation_offline2 and run_data_generation_offline.
  • Add e2e smoke tests for the new datagen system #378: Related modifications to vLLM-based data-generation scripts and e2e test infrastructure for offline/online datagen and vLLM launch utilities.

Suggested reviewers

  • fynnsu
  • dsikka
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: removing and deprecating the old data generation system, which aligns with the substantial deletions and refactoring across the codebase.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description accurately describes the removal of old data generation system, related scripts, and example references, aligning with the substantial changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch deprecate-old-datagen

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify

mergify Bot commented Apr 17, 2026

Copy link
Copy Markdown

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

@mergify mergify Bot added documentation Improvements or additions to documentation quality-failed and removed quality-failed labels Apr 17, 2026

@dsikka dsikka left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make sure to remove the vllm install in the pyproject.toml file. We'll need to verify with Dan that it still gets installed for our e2e tests

@mergify

mergify Bot commented Apr 19, 2026

Copy link
Copy Markdown

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Require two reviews

Wonderful, this rule succeeded.

PRs labelled "two-reviews" must have at least two approving reviews before merging.

  • #approved-reviews-by >= 2

@mergify

mergify Bot commented Apr 20, 2026

Copy link
Copy Markdown

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shanjiaz.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 20, 2026
@shanjiaz shanjiaz force-pushed the deprecate-old-datagen branch from eeae10a to 8b6f649 Compare April 21, 2026 00:08
@mergify

mergify Bot commented Apr 21, 2026

Copy link
Copy Markdown

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

@github-actions

github-actions Bot commented Apr 21, 2026

Copy link
Copy Markdown

Link Check Results (DOCS)

All links are now valid - this issue has been resolved.


Marked as resolved: 2299ca5

@github-actions

github-actions Bot commented Apr 21, 2026

Copy link
Copy Markdown

Link Check Results (REPO)

All links are now valid - this issue has been resolved.


Marked as resolved: 2299ca5

@shanjiaz shanjiaz marked this pull request as ready for review April 21, 2026 13:18

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
scripts/data_generation_offline.py (1)

340-354: Minor: busy-wait on QueueFull could be replaced with cancellable queue.put.

The put_nowait + await asyncio.sleep(0.1) pattern adds up to ~100 ms of latency per backpressure hit and still requires an extra loop. A cleaner alternative is await asyncio.wait([queue.put(...), cancel_event.wait()], return_when=FIRST_COMPLETED) (cancelling the losing task). Not strictly necessary for correctness, but it removes the polling delay and simplifies the cancellation path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/data_generation_offline.py` around lines 340 - 354, The _feed_queue
function currently busy-waits on queue.put_nowait with a sleep loop; replace
that loop with a cancellable await pattern: create two awaitables —
queue.put({"idx": i, "input_ids": item["input_ids"]}) and cancel_event.wait() —
use asyncio.wait(..., return_when=asyncio.FIRST_COMPLETED) to wait for whichever
completes, then if cancel_event won the race break, otherwise ensure you cancel
the pending cancel_event.wait() task (or the pending put task) to avoid leaks
and proceed; remove the asyncio.sleep polling and keep the outer for-loop and
cancel_event.is_set() checks intact so the behavior of _feed_queue, queue,
cancel_event, to_process and dataset is preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/data_generation_offline.py`:
- Around line 110-119: The CLI flag --layer-ids (args.layer_ids) is parsed but
never used; either propagate it to the client/server request as the vLLM
target-layer parameter or remove the flag to avoid misleading users. Fix option
A: locate where the request payload is constructed (e.g., the function building
the inference/request JSON or the client request call) and add a field like
"target_layer_ids": args.layer_ids (or map to the server's --target-layer-ids
naming) so the server receives the override; ensure any async request builder or
send_request function accepts and forwards this value. Fix option B: delete the
parser.add_argument("--layer-ids", ...) entry and associated help text to remove
the unused flag. Also update any help text or docs and keep the symbol names
args.layer_ids and --target-layer-ids consistent.
- Around line 146-154: The --max-retries CLI option is ignored because
generate_hidden_states_async has no retry logic and the openai.AsyncOpenAI
client is created with max_retries=0; update the worker path to implement an
explicit retry loop that uses the parsed max_retries value: pass the CLI
max_retries into the worker invocation (and into any call sites of
generate_hidden_states_async), wrap the call to generate_hidden_states_async in
a retry loop that retries up to max_retries on transient failures, and ensure
retries interact with _FailureTracker and the --fail-on-error semantics so final
post-retry outcomes are what _FailureTracker records; alternatively, if you
prefer not to implement retries, remove the parser.add_argument("--max-retries",
...) to avoid misleading users.
- Around line 414-419: The ValueError message built when checking args.model vs
model_id is using an f-string only on the first literal and concatenating
adjacent string literals without spaces, so {model_id} is not interpolated and
words run together; update the ValueError in the model-check block (the
args.model/model_id comparison) to use a single f-string (or format call) that
includes {model_id} and proper spacing/punctuation so the actual model_id value
appears in the error message when raising ValueError.
- Around line 447-452: The summary log currently prints args.output which can be
None; change the logger call in the end of the processing block to use the
actual resolved hidden_states_dir variable (the directory created/returned by
generate_and_save_hidden_states) so the message reads "Saved X new data points
to <hidden_states_dir>" and similarly ensure any related warning/log about
skipped samples references hidden_states_dir when appropriate; locate the
logger.info call that uses args.output and replace it with hidden_states_dir
(and adjust scope if hidden_states_dir is returned/available in that function).
- Around line 317-323: Replace the hard process termination in the worker except
block with cooperative cancellation and exception propagation: on exception in
the worker (the except Exception as e block) call cancel_event.set(), log the
exception with logger.exception, record the exception in the shared
worker-exception container used by _shutdown_workers (or push it to a
thread-safe queue/list that _shutdown_workers checks), and then return/raise an
asyncio.CancelledError so the worker exits cleanly; rely on _shutdown_workers to
detect and re-raise the first non-cancellation exception and let main() (and
asyncio.run) perform the final sys.exit(1) and proper async/context cleanup
(this avoids calling os._exit(1) and ensures AsyncOpenAI context managers and
tqdm/atexit handlers run their teardown).

---

Nitpick comments:
In `@scripts/data_generation_offline.py`:
- Around line 340-354: The _feed_queue function currently busy-waits on
queue.put_nowait with a sleep loop; replace that loop with a cancellable await
pattern: create two awaitables — queue.put({"idx": i, "input_ids":
item["input_ids"]}) and cancel_event.wait() — use asyncio.wait(...,
return_when=asyncio.FIRST_COMPLETED) to wait for whichever completes, then if
cancel_event won the race break, otherwise ensure you cancel the pending
cancel_event.wait() task (or the pending put task) to avoid leaks and proceed;
remove the asyncio.sleep polling and keep the outer for-loop and
cancel_event.is_set() checks intact so the behavior of _feed_queue, queue,
cancel_event, to_process and dataset is preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a40e290b-8621-4c71-bb80-3f7188755440

📥 Commits

Reviewing files that changed from the base of the PR and between 960263b and 4f15058.

📒 Files selected for processing (26)
  • .coderabbit.yaml
  • README.md
  • docs/examples/index.md
  • docs/index.md
  • docs/scripts/gen_files.py
  • examples/data_generation_and_training/README.md
  • examples/data_generation_and_training/gpt_oss_20b_ultrachat_5k.py
  • examples/data_generation_and_training/llama3_8b_sharegpt_5k.py
  • examples/data_generation_and_training/qwen3_8b_sharegpt_ultrachat.py
  • pyproject.toml
  • scripts/README.md
  • scripts/data_generation_offline.py
  • scripts/data_generation_offline2.py
  • scripts/gen_and_train.py
  • src/speculators/data_generation/config_generator.py
  • src/speculators/data_generation/custom_worker.py
  • src/speculators/data_generation/vllm_hidden_states_generator.py
  • tests/datagen/test_config_generator.py
  • tests/datagen/test_vllm_hidden_states.py
  • tests/e2e/regression/test_eagle3_offline_acceptance.py
  • tests/e2e/smoke/test_offline_training.py
  • tests/e2e/smoke/test_resume_optimizer.py
  • tests/e2e/utils.py
  • tests/integration/datagen/__init__.py
  • tests/integration/datagen/test_preprocessing.py
  • tests/integration/datagen/test_regex_patterns.py
💤 Files with no reviewable changes (15)
  • docs/examples/index.md
  • pyproject.toml
  • README.md
  • examples/data_generation_and_training/llama3_8b_sharegpt_5k.py
  • src/speculators/data_generation/custom_worker.py
  • examples/data_generation_and_training/qwen3_8b_sharegpt_ultrachat.py
  • examples/data_generation_and_training/gpt_oss_20b_ultrachat_5k.py
  • src/speculators/data_generation/vllm_hidden_states_generator.py
  • scripts/gen_and_train.py
  • scripts/data_generation_offline2.py
  • tests/datagen/test_config_generator.py
  • examples/data_generation_and_training/README.md
  • tests/datagen/test_vllm_hidden_states.py
  • scripts/README.md
  • src/speculators/data_generation/config_generator.py

Comment thread scripts/data_generation_offline.py
Comment thread scripts/data_generation_offline.py
Comment thread scripts/data_generation_offline.py
Comment thread scripts/data_generation_offline.py
Comment thread scripts/data_generation_offline.py

@rahul-tuli rahul-tuli left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending removal of vllm dependency from pyproject.toml

Comment thread pyproject.toml Outdated
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
@shanjiaz shanjiaz force-pushed the deprecate-old-datagen branch from 960d8a4 to 6c7c093 Compare April 21, 2026 17:01
@shanjiaz shanjiaz requested review from dsikka and rahul-tuli April 21, 2026 17:33
@shanjiaz shanjiaz enabled auto-merge (squash) April 21, 2026 20:01
@shanjiaz shanjiaz merged commit 8fdee2d into main Apr 21, 2026
14 of 15 checks passed
@shanjiaz shanjiaz deleted the deprecate-old-datagen branch April 21, 2026 20:04
shanjiaz added a commit that referenced this pull request Apr 30, 2026
<!-- markdownlint-disable -->

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT
THE BOTTOM) HAVE BEEN CONSIDERED.

## Purpose
Removed the old data generation system and all references. We should
wait the new examples + docs to land before deprecating the old system

<!--- Why your changes are needed -->

## Description
Blocked by the llm-compressor-testing
[PR](neuralmagic/llm-compressor-testing#261)
that removes the old datagen workflow.

- Removed old data generation system and related scripts/infrastructure.
- Moved preprocessing related tests to `integration` and removed old
data generation related tests.
- Removed examples that use the old e2e flow. 

<!--- High-level concise summary of changes -->

## Related Issue

<!--- Link related issue if applicable -->

## Tests
Nightly run with the new system:
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/24739445799/job/72374945272

<!--- Please describe in detail how you tested your changes. -->

I have filled in:

- [x] The purpose of the PR, such as "Fix some issue (link existing
issues this PR will resolve)".
- [x] The test plan/results, such as providing test command and pasting
the results.
- [ ] (Optional) The necessary documentation update.
- [x] I (a human) have written or reviewed the code in this pr to the
best of my ability.

---------

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/speculators that referenced this pull request May 15, 2026
<!-- markdownlint-disable -->

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT
THE BOTTOM) HAVE BEEN CONSIDERED.

## Purpose
Removed the old data generation system and all references. We should
wait the new examples + docs to land before deprecating the old system

<!--- Why your changes are needed -->

## Description
Blocked by the llm-compressor-testing
[PR](neuralmagic/llm-compressor-testing#261)
that removes the old datagen workflow.

- Removed old data generation system and related scripts/infrastructure.
- Moved preprocessing related tests to `integration` and removed old
data generation related tests.
- Removed examples that use the old e2e flow. 

<!--- High-level concise summary of changes -->

## Related Issue

<!--- Link related issue if applicable -->

## Tests
Nightly run with the new system:
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/24739445799/job/72374945272

<!--- Please describe in detail how you tested your changes. -->

I have filled in:

- [x] The purpose of the PR, such as "Fix some issue (link existing
issues this PR will resolve)".
- [x] The test plan/results, such as providing test command and pasting
the results.
- [ ] (Optional) The necessary documentation update.
- [x] I (a human) have written or reviewed the code in this pr to the
best of my ability.

---------

Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation two-reviews

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants