Support bucket_internal for MPT#1137
Conversation
imangohari1
left a comment
There was a problem hiding this comment.
Looks good so far.
I think there should be a change.
I tested this with mpt-30b as well below and seems to be running fine.
python run_generation.py --model_name_or_path mosaicml/mpt-30b --use_hpu_graphs --use_kv_cache --max_input_tokens 128 --max_new_tokens 128 --trim_logits --bf16 --warmup 2 --n_iterations 2 --limit_hpu_graphs --batch_size 32 --bucket_size 128 --bucket_internalResult: 715.6375115391292 tokens/second
python run_generation.py --model_name_or_path mosaicml/mpt-30b --use_hpu_graphs --use_kv_cache --max_input_tokens 128 --max_new_tokens 128 --trim_logits --bf16 --warmup 2 --n_iterations 2 --limit_hpu_graphs --batch_size 32 --bucket_size 128Result: 471.0239768753105 tokens/second
| input_ids = torch.index_select(input_ids, 1, token_idx - 1) | ||
| # Converting back to tuples as it should be, so there's no type mismatch when calling graph | ||
| past_key_values = tuple([tuple(kv) for kv in past_key_values]) | ||
| elif bucket_internal and token_idx is not None: |
There was a problem hiding this comment.
I've compared this to the llama/qwen2 (below) and I think this line should be
| elif bucket_internal and token_idx is not None: | |
| elif (reuse_cache or bucket_internal) and token_idx is not None: |
https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/models/qwen2/modeling_qwen2.py#L890C14-L890C72
https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/models/llama/modeling_llama.py#L1117C9-L1117C73
There was a problem hiding this comment.
@imangohari1, thanks for review!
There's no reuse_cache support for MPT yet. That's why I didn't add it to the condition.
There was a problem hiding this comment.
historical context: reuse_cache came first, and then this change: #1028
PR 1028 removes the need for reuse_cache. So for new model optimizations I think it is fine to only make changes in line with PR1028, and leave out reuse_cache related changes.
Only in older models, where we already had reuse_cache code, we accomodate both
imangohari1
left a comment
There was a problem hiding this comment.
LGTM!
@regisss
Could you take a final look here?
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
Adds support for
--bucket_internalfor MPT model.Measurements:
--bucket_internal--bucket_internal--bucket_internal--bucket_internal--bucket_internal--bucket_internalCommand line used:
python run_generation.py --model_name_or_path mosaicml/mpt-7b --use_hpu_graphs --use_kv_cache --max_input_tokens 128 --max_new_tokens <num> --trim_logits --bf16 --warmup 2 --n_iterations 2 --limit_hpu_graphs --batch_size=<num> --bucket_size 128 --bucket_internalBefore submitting