fully deprecate old data generation system#433
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughRemove legacy data generation infrastructure including vLLM in-process generators, end-to-end orchestrator wrapper, example training scripts, and related configuration/testing code. Refactor Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The quality checks have failed. Please run |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Require two reviewsWonderful, this rule succeeded.PRs labelled "two-reviews" must have at least two approving reviews before merging.
|
|
This pull request has merge conflicts that must be resolved before it can be |
eeae10a to
8b6f649
Compare
|
The quality checks have failed. Please run |
|
All links are now valid - this issue has been resolved. Marked as resolved: 2299ca5 |
|
All links are now valid - this issue has been resolved. Marked as resolved: 2299ca5 |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
scripts/data_generation_offline.py (1)
340-354: Minor: busy-wait onQueueFullcould be replaced with cancellablequeue.put.The
put_nowait+await asyncio.sleep(0.1)pattern adds up to ~100 ms of latency per backpressure hit and still requires an extra loop. A cleaner alternative isawait asyncio.wait([queue.put(...), cancel_event.wait()], return_when=FIRST_COMPLETED)(cancelling the losing task). Not strictly necessary for correctness, but it removes the polling delay and simplifies the cancellation path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/data_generation_offline.py` around lines 340 - 354, The _feed_queue function currently busy-waits on queue.put_nowait with a sleep loop; replace that loop with a cancellable await pattern: create two awaitables — queue.put({"idx": i, "input_ids": item["input_ids"]}) and cancel_event.wait() — use asyncio.wait(..., return_when=asyncio.FIRST_COMPLETED) to wait for whichever completes, then if cancel_event won the race break, otherwise ensure you cancel the pending cancel_event.wait() task (or the pending put task) to avoid leaks and proceed; remove the asyncio.sleep polling and keep the outer for-loop and cancel_event.is_set() checks intact so the behavior of _feed_queue, queue, cancel_event, to_process and dataset is preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/data_generation_offline.py`:
- Around line 110-119: The CLI flag --layer-ids (args.layer_ids) is parsed but
never used; either propagate it to the client/server request as the vLLM
target-layer parameter or remove the flag to avoid misleading users. Fix option
A: locate where the request payload is constructed (e.g., the function building
the inference/request JSON or the client request call) and add a field like
"target_layer_ids": args.layer_ids (or map to the server's --target-layer-ids
naming) so the server receives the override; ensure any async request builder or
send_request function accepts and forwards this value. Fix option B: delete the
parser.add_argument("--layer-ids", ...) entry and associated help text to remove
the unused flag. Also update any help text or docs and keep the symbol names
args.layer_ids and --target-layer-ids consistent.
- Around line 146-154: The --max-retries CLI option is ignored because
generate_hidden_states_async has no retry logic and the openai.AsyncOpenAI
client is created with max_retries=0; update the worker path to implement an
explicit retry loop that uses the parsed max_retries value: pass the CLI
max_retries into the worker invocation (and into any call sites of
generate_hidden_states_async), wrap the call to generate_hidden_states_async in
a retry loop that retries up to max_retries on transient failures, and ensure
retries interact with _FailureTracker and the --fail-on-error semantics so final
post-retry outcomes are what _FailureTracker records; alternatively, if you
prefer not to implement retries, remove the parser.add_argument("--max-retries",
...) to avoid misleading users.
- Around line 414-419: The ValueError message built when checking args.model vs
model_id is using an f-string only on the first literal and concatenating
adjacent string literals without spaces, so {model_id} is not interpolated and
words run together; update the ValueError in the model-check block (the
args.model/model_id comparison) to use a single f-string (or format call) that
includes {model_id} and proper spacing/punctuation so the actual model_id value
appears in the error message when raising ValueError.
- Around line 447-452: The summary log currently prints args.output which can be
None; change the logger call in the end of the processing block to use the
actual resolved hidden_states_dir variable (the directory created/returned by
generate_and_save_hidden_states) so the message reads "Saved X new data points
to <hidden_states_dir>" and similarly ensure any related warning/log about
skipped samples references hidden_states_dir when appropriate; locate the
logger.info call that uses args.output and replace it with hidden_states_dir
(and adjust scope if hidden_states_dir is returned/available in that function).
- Around line 317-323: Replace the hard process termination in the worker except
block with cooperative cancellation and exception propagation: on exception in
the worker (the except Exception as e block) call cancel_event.set(), log the
exception with logger.exception, record the exception in the shared
worker-exception container used by _shutdown_workers (or push it to a
thread-safe queue/list that _shutdown_workers checks), and then return/raise an
asyncio.CancelledError so the worker exits cleanly; rely on _shutdown_workers to
detect and re-raise the first non-cancellation exception and let main() (and
asyncio.run) perform the final sys.exit(1) and proper async/context cleanup
(this avoids calling os._exit(1) and ensures AsyncOpenAI context managers and
tqdm/atexit handlers run their teardown).
---
Nitpick comments:
In `@scripts/data_generation_offline.py`:
- Around line 340-354: The _feed_queue function currently busy-waits on
queue.put_nowait with a sleep loop; replace that loop with a cancellable await
pattern: create two awaitables — queue.put({"idx": i, "input_ids":
item["input_ids"]}) and cancel_event.wait() — use asyncio.wait(...,
return_when=asyncio.FIRST_COMPLETED) to wait for whichever completes, then if
cancel_event won the race break, otherwise ensure you cancel the pending
cancel_event.wait() task (or the pending put task) to avoid leaks and proceed;
remove the asyncio.sleep polling and keep the outer for-loop and
cancel_event.is_set() checks intact so the behavior of _feed_queue, queue,
cancel_event, to_process and dataset is preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a40e290b-8621-4c71-bb80-3f7188755440
📒 Files selected for processing (26)
.coderabbit.yamlREADME.mddocs/examples/index.mddocs/index.mddocs/scripts/gen_files.pyexamples/data_generation_and_training/README.mdexamples/data_generation_and_training/gpt_oss_20b_ultrachat_5k.pyexamples/data_generation_and_training/llama3_8b_sharegpt_5k.pyexamples/data_generation_and_training/qwen3_8b_sharegpt_ultrachat.pypyproject.tomlscripts/README.mdscripts/data_generation_offline.pyscripts/data_generation_offline2.pyscripts/gen_and_train.pysrc/speculators/data_generation/config_generator.pysrc/speculators/data_generation/custom_worker.pysrc/speculators/data_generation/vllm_hidden_states_generator.pytests/datagen/test_config_generator.pytests/datagen/test_vllm_hidden_states.pytests/e2e/regression/test_eagle3_offline_acceptance.pytests/e2e/smoke/test_offline_training.pytests/e2e/smoke/test_resume_optimizer.pytests/e2e/utils.pytests/integration/datagen/__init__.pytests/integration/datagen/test_preprocessing.pytests/integration/datagen/test_regex_patterns.py
💤 Files with no reviewable changes (15)
- docs/examples/index.md
- pyproject.toml
- README.md
- examples/data_generation_and_training/llama3_8b_sharegpt_5k.py
- src/speculators/data_generation/custom_worker.py
- examples/data_generation_and_training/qwen3_8b_sharegpt_ultrachat.py
- examples/data_generation_and_training/gpt_oss_20b_ultrachat_5k.py
- src/speculators/data_generation/vllm_hidden_states_generator.py
- scripts/gen_and_train.py
- scripts/data_generation_offline2.py
- tests/datagen/test_config_generator.py
- examples/data_generation_and_training/README.md
- tests/datagen/test_vllm_hidden_states.py
- scripts/README.md
- src/speculators/data_generation/config_generator.py
rahul-tuli
left a comment
There was a problem hiding this comment.
LGTM pending removal of vllm dependency from pyproject.toml
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
960d8a4 to
6c7c093
Compare
<!-- markdownlint-disable --> PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED. ## Purpose Removed the old data generation system and all references. We should wait the new examples + docs to land before deprecating the old system <!--- Why your changes are needed --> ## Description Blocked by the llm-compressor-testing [PR](neuralmagic/llm-compressor-testing#261) that removes the old datagen workflow. - Removed old data generation system and related scripts/infrastructure. - Moved preprocessing related tests to `integration` and removed old data generation related tests. - Removed examples that use the old e2e flow. <!--- High-level concise summary of changes --> ## Related Issue <!--- Link related issue if applicable --> ## Tests Nightly run with the new system: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/24739445799/job/72374945272 <!--- Please describe in detail how you tested your changes. --> I have filled in: - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [x] The test plan/results, such as providing test command and pasting the results. - [ ] (Optional) The necessary documentation update. - [x] I (a human) have written or reviewed the code in this pr to the best of my ability. --------- Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
<!-- markdownlint-disable --> PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED. ## Purpose Removed the old data generation system and all references. We should wait the new examples + docs to land before deprecating the old system <!--- Why your changes are needed --> ## Description Blocked by the llm-compressor-testing [PR](neuralmagic/llm-compressor-testing#261) that removes the old datagen workflow. - Removed old data generation system and related scripts/infrastructure. - Moved preprocessing related tests to `integration` and removed old data generation related tests. - Removed examples that use the old e2e flow. <!--- High-level concise summary of changes --> ## Related Issue <!--- Link related issue if applicable --> ## Tests Nightly run with the new system: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/24739445799/job/72374945272 <!--- Please describe in detail how you tested your changes. --> I have filled in: - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [x] The test plan/results, such as providing test command and pasting the results. - [ ] (Optional) The necessary documentation update. - [x] I (a human) have written or reviewed the code in this pr to the best of my ability. --------- Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Removed the old data generation system and all references. We should wait the new examples + docs to land before deprecating the old system
Description
Blocked by the llm-compressor-testing PR that removes the old datagen workflow.
integrationand removed old data generation related tests.Related Issue
Tests
Nightly run with the new system: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/24739445799/job/72374945272
I have filled in: