-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Support for Vanilla and Quantized ChatGLM3 Models to Model Builder #921
base: main
Are you sure you want to change the base?
Conversation
@kunal-vaishnavi PTAL |
@microsoft-github-policy-service agree company="AMD" |
@@ -3,6 +3,7 @@ | |||
# Licensed under the MIT License. See License.txt in the project root for | |||
# license information. | |||
# -------------------------------------------------------------------------- | |||
# Modifications Copyright(C) 2024 Advanced Micro Devices, Inc. All rights reserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snnn, could you please help check the copyright?
src/python/py/models/builder.py
Outdated
self.context_length = config.max_position_embeddings | ||
self.original_context_length = config.original_max_position_embeddings if hasattr(config, "original_max_position_embeddings") else config.rope_scaling["original_max_position_embeddings"] if hasattr(config, "rope_scaling") and hasattr(config.rope_scaling, "original_max_position_embeddings") else config.max_position_embeddings | ||
def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options): | ||
self.context_length = config.max_position_embeddings if hasattr(config, "max_position_embeddings") else config.seq_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to cover the case that config doesn't have seq_length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all the models previously supported by the builder include max_position_embeddings
in their configuration. However, since chatglm3
does not have this parameter, we introduced seq_length
to ensure the configuration remains compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that chatglm3
has the seq_length
parameter here. Since most models use config.max_position_embeddings
instead of config.seq_length
, can we rewrite this to the following?
self.context_length = config.seq_length if hasattr(config, "seq_length") else config.max_position_embeddings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This is a better way to write this config. Incorporated in config refactor_context_length_config
.
Thanks!
src/python/py/models/builder.py
Outdated
self.original_context_length = config.original_max_position_embeddings if hasattr(config, "original_max_position_embeddings") else config.rope_scaling["original_max_position_embeddings"] if hasattr(config, "rope_scaling") and hasattr(config.rope_scaling, "original_max_position_embeddings") else config.max_position_embeddings | ||
def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options): | ||
self.context_length = config.max_position_embeddings if hasattr(config, "max_position_embeddings") else config.seq_length | ||
self.original_context_length = config.original_max_position_embeddings if hasattr(config, "original_max_position_embeddings") else config.rope_scaling["original_max_position_embeddings"] if hasattr(config, "rope_scaling") and hasattr(config.rope_scaling, "original_max_position_embeddings") else config.max_position_embeddings if hasattr(config, "max_position_embeddings") else config.seq_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: replace "config.max_position_embeddings if hasattr(config, "max_position_embeddings") else config.seq_length" with self.context_length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion. Incorporated in the refactor_config
commit
src/python/py/models/builder.py
Outdated
self.window_size = config.sliding_window if hasattr(config, "sliding_window") else -1 # default is -1 in GroupQueryAttention kernel | ||
self.intermediate_size = config.intermediate_size | ||
self.intermediate_size = config.intermediate_size if hasattr(config, "intermediate_size") else config.ffn_hidden_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.intermediate_size = config.intermediate_size if hasattr(config, "intermediate_size") else config.ffn_hidden_size | |
self.intermediate_size = config.ffn_hidden_size if hasattr(config, "ffn_hidden_size") else config.intermediate_size |
src/python/py/models/builder.py
Outdated
self.hidden_size = config.hidden_size | ||
self.num_kv_heads = config.num_key_value_heads if hasattr(config, "num_key_value_heads") else config.num_attention_heads | ||
self.num_kv_heads = config.multi_query_group_num if hasattr(config, "multi_query_group_num") else self.num_kv_heads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this line with the previous one to define self.num_kv_heads
once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do something like this.
self.num_kv_heads = config.num_key_value_heads if hasattr(config, "num_key_value_heads") else config.multi_query_group_num if hasattr(config, "multi_query_group_num") else config.num_attention_heads
src/python/py/models/builder.py
Outdated
self.hidden_size = config.hidden_size | ||
self.num_kv_heads = config.num_key_value_heads if hasattr(config, "num_key_value_heads") else config.num_attention_heads | ||
self.num_kv_heads = config.multi_query_group_num if hasattr(config, "multi_query_group_num") else self.num_kv_heads | ||
self.kv_channels = config.kv_channels if hasattr(config, "kv_channels") else self.num_kv_heads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not appear to be used. Can it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can
src/python/py/models/builder.py
Outdated
self.hidden_size = config.hidden_size | ||
self.num_kv_heads = config.num_key_value_heads if hasattr(config, "num_key_value_heads") else config.num_attention_heads | ||
self.num_kv_heads = config.multi_query_group_num if hasattr(config, "multi_query_group_num") else self.num_kv_heads | ||
self.kv_channels = config.kv_channels if hasattr(config, "kv_channels") else self.num_kv_heads | ||
self.multi_query_attention = config.multi_query_attention if hasattr(config, "multi_query_attention") else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multi-query attention can be identified with self.num_kv_heads == 1
. Since it is only used once within the ChatGLMModel
class and it is a small check, I don't think this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since, the ChatGLMModel using GQA with self.num_kv_heads = 2 currently Here, I think we can use the following check to choose GQA or MHA. Please let me know if we should try something else?
self.attention_attrs["op_type"] = "GroupQueryAttention" if self.num_attn_heads > self.num_kv_heads else self.attention_attrs["op_type"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a later section to handle the selection of the attention op type.
onnxruntime-genai/src/python/py/models/builder.py
Lines 229 to 247 in f768673
valid_gqa_configurations = [ | |
("cpu", TensorProto.FLOAT), | |
("cuda", TensorProto.FLOAT16), | |
("dml", TensorProto.FLOAT16), | |
] | |
if (self.ep, self.io_dtype) in valid_gqa_configurations: | |
# Change model settings for GroupQueryAttention | |
self.attention_attrs["op_type"] = "GroupQueryAttention" | |
print("GroupQueryAttention (GQA) is used in this model.") | |
# DML doesn't support packed Q/K/V for GQA yet | |
self.attention_attrs["use_packed_matmul"] = self.ep != "dml" | |
# GQA + Rot.Emb. does not require `position ids` as input | |
if self.ep != "dml": | |
self.attention_attrs["use_rotemb_in_attn"] = True | |
self.input_names.remove("position_ids") | |
self.past_present_share_buffer = self.attention_attrs["op_type"] == "GroupQueryAttention" |
This section already handles choosing between GQA and MHA as the attention op. The set of valid GQA configurations covers most precision + execution provider combinations. Since GQA is a superset of MHA, we can use GQA when self.num_attn_heads != self.num_kv_heads
or self.num_attn_heads == self.num_kv_heads
. Thus, I don't think we need to worry about selecting the attention op for ChatGLM.
src/python/py/models/builder.py
Outdated
@@ -1770,11 +1795,15 @@ def make_model(self, input_path): | |||
self.make_layer(self.layer_id, module) | |||
self.layer_id += 1 | |||
|
|||
elif module.__class__.__name__.endswith("GLMBlock") and self.layer_id < self.num_layers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this condition with the previous elif
block and remove this case?
elif (module.__class__.__name__.endswith("DecoderLayer") or module.__class__.__name__.endswith("GLMBlock")) and self.layer_id < self.num_layers:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion! Yes we can.
elif self.layer_id == self.num_layers and self.has_final_norm(module, model): | ||
# SkipLayerNorm after last decoder layer (MatMul --> SkipLayerNorm) | ||
print("Reading final norm") | ||
self.make_layernorm(self.layer_id, module, skip=True, simple=self.layernorm_attrs["simple"], location="final_norm") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra line space between the conditions helps make the different cases more readable. Can this line deletion be reverted?
src/python/py/models/builder.py
Outdated
|
||
def has_final_norm(self, module, model): | ||
# Hugging Face names | ||
hf_norm = hasattr(model, "model") and hasattr(model.model, "norm") and module == model.model.norm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the changes in ChatGLMModel.has_final_norm
be upstreamed to Model.has_final_norm
? Then, this method doesn't have to be re-defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change Model.has_final_norm
like this?
def has_final_norm(self, module, model):
# Hugging Face names
hf_norm = hasattr(model, "model") and hasattr(model.model, "norm") and module == model.model.norm
if(hasattr(model, "transformer")):
hf_final_layernorm = hasattr(model.transformer, "encoder") and hasattr(model.transformer.encoder, "final_layernorm") and module == model.transformer.encoder.final_layernorm
else:
hf_final_layernorm = hasattr(model.model, "final_layernorm") and module == model.model.final_layernorm
# GGUF names
gguf_final_norm = hasattr(model, "final_norm") and module == model.final_norm
return hf_norm or hf_final_layernorm or gguf_final_norm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just create another variable for ChatGLM's logic and add that variable to the return
logic.
def has_final_norm(self, module, model):
# Hugging Face names
hf_norm = hasattr(model, "model") and hasattr(model.model, "norm") and module == model.model.norm
hf_final_layernorm = hasattr(model, "model") and hasattr(model.model, "final_layernorm") and module == model.model.final_layernorm
hf_transformer_final_layernorm = hasattr(model, "transformer") and hasattr(model.transformer, "encoder") and hasattr(model.transformer.encoder, "final_layernorm") and module == model.transformer.encoder.final_layernorm
# GGUF names
gguf_final_norm = hasattr(model, "final_norm") and module == model.final_norm
return hf_norm or hf_final_layernorm or hf_transformer_final_layernorm or gguf_final_norm
src/python/py/models/builder.py
Outdated
class ChatGLMModel(Model): | ||
def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options): | ||
super().__init__(config, io_dtype, onnx_dtype, ep, cache_dir, extra_options) | ||
# self.input_shapes["position_ids"] = [1] # Note: This is optional and only needed if you want position_ids to be an int instead of a 2D tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# self.input_shapes["position_ids"] = [1] # Note: This is optional and only needed if you want position_ids to be an int instead of a 2D tensor |
src/python/py/models/builder.py
Outdated
def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options): | ||
super().__init__(config, io_dtype, onnx_dtype, ep, cache_dir, extra_options) | ||
# self.input_shapes["position_ids"] = [1] # Note: This is optional and only needed if you want position_ids to be an int instead of a 2D tensor | ||
self.layernorm_attrs["simple"] = True # RMSNorm in ChatGLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is True
so I think we can remove this.
onnxruntime-genai/src/python/py/models/builder.py
Lines 151 to 152 in f768673
self.layernorm_attrs = { | |
"simple": True, # Use SimplifiedLayerNorm/SkipSimplifiedLayerNorm vs. LayerNorm/SkipLayerNorm |
src/python/py/models/builder.py
Outdated
self.rotemb_attrs["partial_rotary_factor"] = 0.5 # Line 755 of modeling_chatglm.py check self.rotary_pos_emb declaration | ||
self.rotemb_attrs["rotary_embedding_dim"] = int(self.head_size * self.rotemb_attrs["partial_rotary_factor"]) | ||
self.rotemb_attrs["interleaved"] = 1 | ||
self.mlp_attrs["use_proj"], self.mlp_attrs["use_fc"] = True, False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The defaults are the same as below so I think we can remove this.
onnxruntime-genai/src/python/py/models/builder.py
Lines 250 to 252 in f768673
self.mlp_attrs = { | |
"use_proj": True, # Use projection style for MLP (GateProj/UpProj/DownProj) | |
"use_fc": False, # Use fully-connected style for MLP (FC1/FC2) |
src/python/py/models/builder.py
Outdated
else: | ||
raise NotImplementedError(f"The {hf_name} model is not currently supported.") | ||
raise NotImplementedError( | ||
f"The {hf_name} model is not currently supported. Got {config}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add the config to the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the config from the error. I think it was used for debugging internally.
self.rotemb_attrs["interleaved"] = 1 | ||
self.mlp_attrs["use_proj"], self.mlp_attrs["use_fc"] = True, False | ||
self.attention_attrs["use_rotemb_in_attn"] = True | ||
self.attention_attrs["use_packed_matmul"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The combination of these attention_attrs
settings will cause the FP32 CUDA model to be created incorrectly. For the FP32 CUDA model, the GroupQueryAttention
op does not have an implementation. The fallback is to use the MultiHeadAttention
op, which does not fuse rotary embeddings within it and does not handle the extra repeat_kv
operation needed when num_attention_heads != num_key_value_heads
. Therefore, the FP32 CUDA model would need the following settings.
use_rotemb_in_attn = False
use_packed_matmul = False
op_type = "MultiHeadAttention"
The subgraph for FP32 CUDA should look something like this.
# MultiHeadAttention example:
#
# root_input
# / | \
# Q_MatMul K_MatMul V_MatMul 4D causal mask past_key past_value
# | | | | | |
# Q_Rotary K_Rotary | +------------+-----------+
# | | | |
# | Repeat_K Repeat_V |
# \ | / |
# MultiHeadAttention--------------------------+
# |
# O_MatMul
# |
# O_Add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This relates to #880, for our use case, we would like GroupQueryAttention
and our custom ep would have an implementation for it. I'm ok with changing the default behavior to adhere what is available in CPU/CUDA, as long as there is a way to override and generate with selected ops.
@@ -1432,14 +1433,21 @@ def make_mlp(self, layer_id, mlp, root_input): | |||
raise NotImplementedError(f"The MLP layer type is not set.") | |||
|
|||
def make_mlp_unpacked(self, layer_id, mlp, root_input): | |||
packed_proj = getattr(mlp, "gate_up_proj", None) or getattr( | |||
mlp, "dense_h_to_4h", None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these definitions be made in one line?
self.rotemb_attrs["rotary_embedding_dim"] = int(self.head_size * self.rotemb_attrs["partial_rotary_factor"]) | ||
self.rotemb_attrs["interleaved"] = 1 | ||
self.attention_attrs["use_rotemb_in_attn"] = True | ||
self.attention_attrs["use_packed_matmul"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self.attention_attrs["use_rotemb_in_attn"] = True
and self.attention_attrs["use_packed_matmul"] = True
lines can be deleted since these will be automatically set when the attention op is set.
@@ -1469,8 +1477,11 @@ def make_mlp_proj(self, layer_id, mlp, root_input): | |||
self.make_mul(mul_name, mul_inputs, dtype=self.io_dtype, shape=["batch_size", "sequence_length", self.intermediate_size]) | |||
|
|||
# Make output MatMul node | |||
down_proj = getattr(mlp, "down_proj", None) or getattr( | |||
mlp, "dense_4h_to_h", None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this definition be made in one line?
@@ -94,31 +95,42 @@ def __init__(self, quant_type, input_path, bits, group_size, q_size, kv_size, in | |||
layer_id = 0 | |||
for weight_file in os.listdir(input_path): | |||
if weight_file.endswith(".safetensors"): | |||
module = self.layers.setdefault(layer_id, QuantizedDecoderLayer(layer_id, bits, group_size)) | |||
module = self.layers.setdefault( | |||
layer_id, QuantizedDecoderLayer(layer_id, bits, group_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this definition be made in one line?
module = self.layers.setdefault(layer_id, QuantizedDecoderLayer(layer_id, bits, group_size)) | ||
module = self.layers.setdefault( | ||
layer_id, | ||
QuantizedDecoderLayer(layer_id, bits, group_size), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this definition be made in one line?
q_dim = q_size // (32 // bits) if quant_type in {"awq", "gptq"} else q_size | ||
kv_dim = kv_size // (32 // bits) if quant_type in {"awq", "gptq"} else kv_size | ||
module.self_attn.q_proj.qzeros = tensor[:, : q_dim] | ||
module.self_attn.k_proj.qzeros = tensor[:, q_dim : q_dim + kv_dim] | ||
module.self_attn.v_proj.qzeros = tensor[:, q_dim + kv_dim :] | ||
elif bool(re.match(r"^model.layers\.\d+\.self_attn.qkv_proj\.g_idx$", name)): | ||
elif bool(re.match(r"^model.layers\.\d+\.(self_attn.qkv_proj|self_attention.query_key_value)\.g_idx$", name)): | ||
# model.layers.layer_id.self_attn.qkv_proj.g_ix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# model.layers.layer_id.self_attn.qkv_proj.g_ix | |
# model.layers.layer_id.self_attn.qkv_proj.g_idx |
# model.layers.layer_id.mlp.gate_up_proj.g_idx | ||
# model.layers.layer_id.mlp.dense_h_to_4h.g_idx | ||
module.mlp.gate_proj.g_idx = tensor | ||
module.mlp.up_proj.g_idx = tensor | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a condition for gate_up_proj.bias
/dense_h_to_4h.bias
?
We have integrated support for both vanilla and quantized versions of the ChatGLM3 models into the Model Builder. Additionally, we have successfully performed parity checks for both model types to ensure consistency and reliability across different configurations.