-
Notifications
You must be signed in to change notification settings - Fork 432
Add vLLM-based runtime statistics for subblock latency measurement #1358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
816ddfa
enabling runtime optimization
grzegorz-k-karch 7aa5fe7
Merge branch 'main' into gkarch/runtime_opt
grzegorz-k-karch 3041dc2
done ruff formatting and docstrings
grzegorz-k-karch a363750
distributed timeout is configurable
grzegorz-k-karch 8739fa0
Merge branch 'main' into gkarch/runtime_opt
grzegorz-k-karch 53a2caf
added example config for attn pruning and runtime constraint
grzegorz-k-karch dfb905c
renamed configs
grzegorz-k-karch e165171
working on readme
grzegorz-k-karch d47b69c
working on refactoring
grzegorz-k-karch 12ed46b
working on fix
grzegorz-k-karch ab925b9
runtime accuracy improved
grzegorz-k-karch 58f17e4
using vllm api instead of subprocess
grzegorz-k-karch e868303
working on review feedback
grzegorz-k-karch f7be643
removed unused batch_size; cleaned up config loading
grzegorz-k-karch 8423676
Merge branch 'main' into gkarch/runtime_opt
grzegorz-k-karch 49235d1
cleanup based on pre-commit
grzegorz-k-karch 781d44d
added docstrings
grzegorz-k-karch a1901c7
updated readme
grzegorz-k-karch 0b75502
further changes based on review
grzegorz-k-karch 7e2f995
further changes based on review
grzegorz-k-karch 2ca5306
removed synth_dataset_num_requests
grzegorz-k-karch ca21748
removed duplicate model saving
grzegorz-k-karch 26ceb36
added test
grzegorz-k-karch 4c5b133
suppressing bandit warnings B404 and B603; precedence found in repo
grzegorz-k-karch 398808a
removed gpu utilization param
grzegorz-k-karch e468f62
wip
grzegorz-k-karch 34dbe52
removed redundant configs; guards for vllm results
grzegorz-k-karch 24fa2d5
following annotation suggestion
grzegorz-k-karch 4b824f1
updated readme
grzegorz-k-karch ae25ec7
moved stats utils from nas to puzzletron
grzegorz-k-karch c14cad0
Merge branch 'main' into gkarch/runtime_opt
grzegorz-k-karch f34d3a3
responding to reviews
grzegorz-k-karch 3332149
reenabled some vars
grzegorz-k-karch 88e16d7
added support for batch_sizes
grzegorz-k-karch 3f69e55
further fixes
grzegorz-k-karch 7e48dbd
Merge branch 'main' into gkarch/runtime_opt
grzegorz-k-karch 36f4685
using 5s latency target in the example
grzegorz-k-karch b1b810f
added vllm adapter
grzegorz-k-karch 354dd8d
Merge branch 'main' into gkarch/runtime_opt
grzegorz-k-karch f49fbc9
disabled vllm tests that depends on anymodel
grzegorz-k-karch cebc4cd
Merge branch 'main' into gkarch/runtime_opt
kevalmorabia97 d6e1c6b
Fix CI failures
kevalmorabia97 105c736
Merge branch 'main' into gkarch/runtime_opt
kevalmorabia97 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
103 changes: 103 additions & 0 deletions
103
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/Llama-3_1-8B.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| defaults: | ||
| - ../llama-3_1-8B_pruneffn_memory/pruning/ffn_pruning@pruning | ||
| - ../llama-3_1-8B_pruneffn_memory/validate_solutions_defaults@scoring | ||
| - ../llama-3_1-8B_pruneffn_memory/validate_solutions_defaults@realize_model | ||
| - bypass: | ||
| - override hydra/hydra_logging: disabled | ||
| - _self_ | ||
|
|
||
| puzzle_dir: ??? | ||
| descriptor: llama | ||
| teacher_dir: ${puzzle_dir}/ckpts/teacher/ | ||
| replacement_library_path: ${puzzle_dir}/replacement_library.json | ||
| dataset_path: ??? # ppath to Nemotron-Post-Training-Dataset-v2 | ||
|
|
||
| skip_realize_model: false | ||
|
|
||
| build_replacement_library: | ||
| add_ffn_no_ops: true | ||
| add_attention_no_ops: true | ||
|
|
||
| calc_subblock_stats: | ||
| batch_sizes: [1, 4] | ||
| prefill_seq_len: 1024 | ||
| generation_seq_len: 1024 | ||
| num_active_tokens_override: # Optional override for sequence lengths | ||
| prefill_queue_size: 0 | ||
| allocate_prefill_query: false | ||
| merge_with_existing_stats: false | ||
| subblock_stats_filename: "subblock_stats.json" | ||
| moe_stats_filename: "moe_stats.json" | ||
|
|
||
| scoring: | ||
| descriptor: ${descriptor} | ||
| solutions_to_validate: | ||
| skip_existing_solutions: true | ||
|
|
||
| replacement_library_path: ${replacement_library_path} | ||
| solutions_path: ${to_path:${puzzle_dir}/single_sequence_replacement_solutions.json} | ||
| teacher_dir: ${to_path:${teacher_dir}} | ||
| output_dir: ${puzzle_dir}/single_sequence_replacement_solutions--validation | ||
|
|
||
| eval_samples: 128 | ||
| micro_batch_size: 1 | ||
| seed: 42 | ||
| shuffle_seed: 444 | ||
| dataset_path: ${dataset_path} | ||
|
|
||
| mip: | ||
| single_block_replacement_validation_dir: ${to_path:${scoring.output_dir}} | ||
| subblock_stats_path: ${to_path:${puzzle_dir}/${calc_subblock_stats.subblock_stats_filename}} | ||
| output_path: ${to_path:${puzzle_dir}/mip/puzzle_solutions} | ||
| gathered_metrics_path: | ||
| puzzle_profile: | ||
|
|
||
| # puzzle_profile: | ||
| objective: metrics.cosine_embedding_loss_hidden_states | ||
| bigger_is_better: false | ||
|
|
||
| subblock_stats_args: | ||
| - batch_size: 1 | ||
| weights_dtype: torch.bfloat16 | ||
|
|
||
| report_additional_costs: | ||
| - stats.memory_mib | ||
| - stats.num_params | ||
| - stats.num_kv_heads | ||
| - stats.has_attention | ||
| - stats.has_ffn | ||
| - stats.kv_cache_memory_mib | ||
| - stats.attention_memory_mib | ||
| - stats.ffn_memory_mib | ||
| - stats.ffn_num_params | ||
| - stats.attention_num_params | ||
|
|
||
| human_constraints: | ||
| target_latency_seconds: 5 | ||
|
|
||
| mip_constraints: | ||
| metric_overrides: | ||
| max_seconds_per_solution: 60 | ||
|
|
||
| realize_model: | ||
| descriptor: ${descriptor} | ||
| teacher_dir: ${to_path:${teacher_dir}} | ||
| tokenizer_name: ${to_path:${teacher_dir}} | ||
| replacement_library_path: ${replacement_library_path} | ||
| save_models: true | ||
| solutions_path: # Filled dynamically | ||
|
|
||
| # Validate params | ||
| skip_validation: false # To enable validation of the model solution set `skip_validation` as False | ||
| eval_samples: 128 | ||
| micro_batch_size: 1 | ||
| seed: 42 | ||
| shuffle_seed: 444 | ||
| dataset_path: ${dataset_path} | ||
|
|
||
| nccl_timeout_minutes: ${timedelta_minutes:120} | ||
|
|
||
| # This section redirects Hydra outputs | ||
| hydra: | ||
| run: | ||
| dir: ${puzzle_dir}/hydra_logs/${now:%Y-%m-%d}/${now:%H-%M-%S} |
22 changes: 22 additions & 0 deletions
22
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/llama-3_1-8B_pruneffn_runtime.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| defaults: | ||
| - Llama-3_1-8B | ||
| - _self_ | ||
|
|
||
| # Input Hugging Face model to compress | ||
| input_hf_model_path: /workspace/hf_models/meta-llama/Llama-3.1-8B-Instruct | ||
|
|
||
| # Dataset path for pruning and NAS scoring | ||
| dataset_path: /workspace/datasets/Nemotron-Post-Training-Dataset-v2 | ||
|
|
||
| # Working directory for puzzletron outputs | ||
| puzzle_dir: /workspace/puzzle_dir | ||
|
|
||
| calc_subblock_stats: | ||
| runtime_stats: | ||
| enabled: true | ||
| num_warmup_iters: 2 | ||
| num_iters: 10 | ||
|
|
||
| # FFN intermediate sizes to search over (heterogeneous architecture) | ||
| pruning: | ||
| intermediate_size_list: [3072, 5888, 8704, 11520] # teacher_intermediate_size is 14336 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.