Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Support for Vanilla and Quantized ChatGLM3 Models to Model Builder #921

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

amd-sudo-sh
Copy link

We have integrated support for both vanilla and quantized versions of the ChatGLM3 models into the Model Builder. Additionally, we have successfully performed parity checks for both model types to ensure consistency and reliability across different configurations.

@BowenBao
Copy link
Contributor

@kunal-vaishnavi PTAL

@amd-sudo-sh
Copy link
Author

@microsoft-github-policy-service agree company="AMD"

@@ -3,6 +3,7 @@
# Licensed under the MIT License. See License.txt in the project root for
# license information.
# --------------------------------------------------------------------------
# Modifications Copyright(C) 2024 Advanced Micro Devices, Inc. All rights reserved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snnn, could you please help check the copyright?

self.context_length = config.max_position_embeddings
self.original_context_length = config.original_max_position_embeddings if hasattr(config, "original_max_position_embeddings") else config.rope_scaling["original_max_position_embeddings"] if hasattr(config, "rope_scaling") and hasattr(config.rope_scaling, "original_max_position_embeddings") else config.max_position_embeddings
def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options):
self.context_length = config.max_position_embeddings if hasattr(config, "max_position_embeddings") else config.seq_length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to cover the case that config doesn't have seq_length

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all the models previously supported by the builder include max_position_embeddings in their configuration. However, since chatglm3 does not have this parameter, we introduced seq_length to ensure the configuration remains compatible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that chatglm3 has the seq_length parameter here. Since most models use config.max_position_embeddings instead of config.seq_length, can we rewrite this to the following?

self.context_length = config.seq_length if hasattr(config, "seq_length") else config.max_position_embeddings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This is a better way to write this config. Incorporated in config refactor_context_length_config .
Thanks!

self.original_context_length = config.original_max_position_embeddings if hasattr(config, "original_max_position_embeddings") else config.rope_scaling["original_max_position_embeddings"] if hasattr(config, "rope_scaling") and hasattr(config.rope_scaling, "original_max_position_embeddings") else config.max_position_embeddings
def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options):
self.context_length = config.max_position_embeddings if hasattr(config, "max_position_embeddings") else config.seq_length
self.original_context_length = config.original_max_position_embeddings if hasattr(config, "original_max_position_embeddings") else config.rope_scaling["original_max_position_embeddings"] if hasattr(config, "rope_scaling") and hasattr(config.rope_scaling, "original_max_position_embeddings") else config.max_position_embeddings if hasattr(config, "max_position_embeddings") else config.seq_length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replace "config.max_position_embeddings if hasattr(config, "max_position_embeddings") else config.seq_length" with self.context_length

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion. Incorporated in the refactor_config commit

self.window_size = config.sliding_window if hasattr(config, "sliding_window") else -1 # default is -1 in GroupQueryAttention kernel
self.intermediate_size = config.intermediate_size
self.intermediate_size = config.intermediate_size if hasattr(config, "intermediate_size") else config.ffn_hidden_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.intermediate_size = config.intermediate_size if hasattr(config, "intermediate_size") else config.ffn_hidden_size
self.intermediate_size = config.ffn_hidden_size if hasattr(config, "ffn_hidden_size") else config.intermediate_size

self.hidden_size = config.hidden_size
self.num_kv_heads = config.num_key_value_heads if hasattr(config, "num_key_value_heads") else config.num_attention_heads
self.num_kv_heads = config.multi_query_group_num if hasattr(config, "multi_query_group_num") else self.num_kv_heads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this line with the previous one to define self.num_kv_heads once?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do something like this.

self.num_kv_heads = config.num_key_value_heads if hasattr(config, "num_key_value_heads") else config.multi_query_group_num if hasattr(config, "multi_query_group_num") else config.num_attention_heads

self.hidden_size = config.hidden_size
self.num_kv_heads = config.num_key_value_heads if hasattr(config, "num_key_value_heads") else config.num_attention_heads
self.num_kv_heads = config.multi_query_group_num if hasattr(config, "multi_query_group_num") else self.num_kv_heads
self.kv_channels = config.kv_channels if hasattr(config, "kv_channels") else self.num_kv_heads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not appear to be used. Can it be removed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can

self.hidden_size = config.hidden_size
self.num_kv_heads = config.num_key_value_heads if hasattr(config, "num_key_value_heads") else config.num_attention_heads
self.num_kv_heads = config.multi_query_group_num if hasattr(config, "multi_query_group_num") else self.num_kv_heads
self.kv_channels = config.kv_channels if hasattr(config, "kv_channels") else self.num_kv_heads
self.multi_query_attention = config.multi_query_attention if hasattr(config, "multi_query_attention") else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-query attention can be identified with self.num_kv_heads == 1. Since it is only used once within the ChatGLMModel class and it is a small check, I don't think this is needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since, the ChatGLMModel using GQA with self.num_kv_heads = 2 currently Here, I think we can use the following check to choose GQA or MHA. Please let me know if we should try something else?

self.attention_attrs["op_type"] = "GroupQueryAttention" if self.num_attn_heads > self.num_kv_heads else self.attention_attrs["op_type"]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a later section to handle the selection of the attention op type.

valid_gqa_configurations = [
("cpu", TensorProto.FLOAT),
("cuda", TensorProto.FLOAT16),
("dml", TensorProto.FLOAT16),
]
if (self.ep, self.io_dtype) in valid_gqa_configurations:
# Change model settings for GroupQueryAttention
self.attention_attrs["op_type"] = "GroupQueryAttention"
print("GroupQueryAttention (GQA) is used in this model.")
# DML doesn't support packed Q/K/V for GQA yet
self.attention_attrs["use_packed_matmul"] = self.ep != "dml"
# GQA + Rot.Emb. does not require `position ids` as input
if self.ep != "dml":
self.attention_attrs["use_rotemb_in_attn"] = True
self.input_names.remove("position_ids")
self.past_present_share_buffer = self.attention_attrs["op_type"] == "GroupQueryAttention"

This section already handles choosing between GQA and MHA as the attention op. The set of valid GQA configurations covers most precision + execution provider combinations. Since GQA is a superset of MHA, we can use GQA when self.num_attn_heads != self.num_kv_heads or self.num_attn_heads == self.num_kv_heads. Thus, I don't think we need to worry about selecting the attention op for ChatGLM.

@@ -1770,11 +1795,15 @@ def make_model(self, input_path):
self.make_layer(self.layer_id, module)
self.layer_id += 1

elif module.__class__.__name__.endswith("GLMBlock") and self.layer_id < self.num_layers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this condition with the previous elif block and remove this case?

elif (module.__class__.__name__.endswith("DecoderLayer") or module.__class__.__name__.endswith("GLMBlock")) and self.layer_id < self.num_layers:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! Yes we can.

elif self.layer_id == self.num_layers and self.has_final_norm(module, model):
# SkipLayerNorm after last decoder layer (MatMul --> SkipLayerNorm)
print("Reading final norm")
self.make_layernorm(self.layer_id, module, skip=True, simple=self.layernorm_attrs["simple"], location="final_norm")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra line space between the conditions helps make the different cases more readable. Can this line deletion be reverted?


def has_final_norm(self, module, model):
# Hugging Face names
hf_norm = hasattr(model, "model") and hasattr(model.model, "norm") and module == model.model.norm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the changes in ChatGLMModel.has_final_norm be upstreamed to Model.has_final_norm? Then, this method doesn't have to be re-defined.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change Model.has_final_norm like this?

def has_final_norm(self, module, model):
        # Hugging Face names
        hf_norm = hasattr(model, "model") and hasattr(model.model, "norm") and module == model.model.norm
        
        if(hasattr(model, "transformer")):
            hf_final_layernorm = hasattr(model.transformer, "encoder") and hasattr(model.transformer.encoder, "final_layernorm") and module == model.transformer.encoder.final_layernorm
        else:
            hf_final_layernorm = hasattr(model.model, "final_layernorm") and module == model.model.final_layernorm
        
        # GGUF names
        gguf_final_norm = hasattr(model, "final_norm") and module == model.final_norm
        return hf_norm or hf_final_layernorm or gguf_final_norm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just create another variable for ChatGLM's logic and add that variable to the return logic.

def has_final_norm(self, module, model):
    # Hugging Face names
    hf_norm = hasattr(model, "model") and hasattr(model.model, "norm") and module == model.model.norm
    hf_final_layernorm = hasattr(model, "model") and hasattr(model.model, "final_layernorm") and module == model.model.final_layernorm
    hf_transformer_final_layernorm = hasattr(model, "transformer") and hasattr(model.transformer, "encoder") and hasattr(model.transformer.encoder, "final_layernorm") and module == model.transformer.encoder.final_layernorm
    # GGUF names
    gguf_final_norm = hasattr(model, "final_norm") and module == model.final_norm
    return hf_norm or hf_final_layernorm or hf_transformer_final_layernorm or gguf_final_norm

class ChatGLMModel(Model):
def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options):
super().__init__(config, io_dtype, onnx_dtype, ep, cache_dir, extra_options)
# self.input_shapes["position_ids"] = [1] # Note: This is optional and only needed if you want position_ids to be an int instead of a 2D tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# self.input_shapes["position_ids"] = [1] # Note: This is optional and only needed if you want position_ids to be an int instead of a 2D tensor

def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options):
super().__init__(config, io_dtype, onnx_dtype, ep, cache_dir, extra_options)
# self.input_shapes["position_ids"] = [1] # Note: This is optional and only needed if you want position_ids to be an int instead of a 2D tensor
self.layernorm_attrs["simple"] = True # RMSNorm in ChatGLM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is True so I think we can remove this.

self.layernorm_attrs = {
"simple": True, # Use SimplifiedLayerNorm/SkipSimplifiedLayerNorm vs. LayerNorm/SkipLayerNorm

self.rotemb_attrs["partial_rotary_factor"] = 0.5 # Line 755 of modeling_chatglm.py check self.rotary_pos_emb declaration
self.rotemb_attrs["rotary_embedding_dim"] = int(self.head_size * self.rotemb_attrs["partial_rotary_factor"])
self.rotemb_attrs["interleaved"] = 1
self.mlp_attrs["use_proj"], self.mlp_attrs["use_fc"] = True, False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defaults are the same as below so I think we can remove this.

self.mlp_attrs = {
"use_proj": True, # Use projection style for MLP (GateProj/UpProj/DownProj)
"use_fc": False, # Use fully-connected style for MLP (FC1/FC2)

else:
raise NotImplementedError(f"The {hf_name} model is not currently supported.")
raise NotImplementedError(
f"The {hf_name} model is not currently supported. Got {config}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add the config to the error?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the config from the error. I think it was used for debugging internally.

self.rotemb_attrs["interleaved"] = 1
self.mlp_attrs["use_proj"], self.mlp_attrs["use_fc"] = True, False
self.attention_attrs["use_rotemb_in_attn"] = True
self.attention_attrs["use_packed_matmul"] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The combination of these attention_attrs settings will cause the FP32 CUDA model to be created incorrectly. For the FP32 CUDA model, the GroupQueryAttention op does not have an implementation. The fallback is to use the MultiHeadAttention op, which does not fuse rotary embeddings within it and does not handle the extra repeat_kv operation needed when num_attention_heads != num_key_value_heads. Therefore, the FP32 CUDA model would need the following settings.

  • use_rotemb_in_attn = False
  • use_packed_matmul = False
  • op_type = "MultiHeadAttention"

The subgraph for FP32 CUDA should look something like this.

# MultiHeadAttention example:
#
#               root_input
#              /     |     \
#       Q_MatMul  K_MatMul  V_MatMul  4D causal mask  past_key  past_value
#           |        |         |            |            |           |
#       Q_Rotary  K_Rotary     |            +------------+-----------+
#           |        |         |                         |
#           |     Repeat_K  Repeat_V                     |
#           \        |        /                          |
#            MultiHeadAttention--------------------------+
#                    |
#                 O_MatMul
#                    |
#                  O_Add

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relates to #880, for our use case, we would like GroupQueryAttention and our custom ep would have an implementation for it. I'm ok with changing the default behavior to adhere what is available in CPU/CUDA, as long as there is a way to override and generate with selected ops.

@@ -1432,14 +1433,21 @@ def make_mlp(self, layer_id, mlp, root_input):
raise NotImplementedError(f"The MLP layer type is not set.")

def make_mlp_unpacked(self, layer_id, mlp, root_input):
packed_proj = getattr(mlp, "gate_up_proj", None) or getattr(
mlp, "dense_h_to_4h", None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these definitions be made in one line?

self.rotemb_attrs["rotary_embedding_dim"] = int(self.head_size * self.rotemb_attrs["partial_rotary_factor"])
self.rotemb_attrs["interleaved"] = 1
self.attention_attrs["use_rotemb_in_attn"] = True
self.attention_attrs["use_packed_matmul"] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self.attention_attrs["use_rotemb_in_attn"] = True and self.attention_attrs["use_packed_matmul"] = True lines can be deleted since these will be automatically set when the attention op is set.

@@ -1469,8 +1477,11 @@ def make_mlp_proj(self, layer_id, mlp, root_input):
self.make_mul(mul_name, mul_inputs, dtype=self.io_dtype, shape=["batch_size", "sequence_length", self.intermediate_size])

# Make output MatMul node
down_proj = getattr(mlp, "down_proj", None) or getattr(
mlp, "dense_4h_to_h", None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this definition be made in one line?

@@ -94,31 +95,42 @@ def __init__(self, quant_type, input_path, bits, group_size, q_size, kv_size, in
layer_id = 0
for weight_file in os.listdir(input_path):
if weight_file.endswith(".safetensors"):
module = self.layers.setdefault(layer_id, QuantizedDecoderLayer(layer_id, bits, group_size))
module = self.layers.setdefault(
layer_id, QuantizedDecoderLayer(layer_id, bits, group_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this definition be made in one line?

module = self.layers.setdefault(layer_id, QuantizedDecoderLayer(layer_id, bits, group_size))
module = self.layers.setdefault(
layer_id,
QuantizedDecoderLayer(layer_id, bits, group_size),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this definition be made in one line?

q_dim = q_size // (32 // bits) if quant_type in {"awq", "gptq"} else q_size
kv_dim = kv_size // (32 // bits) if quant_type in {"awq", "gptq"} else kv_size
module.self_attn.q_proj.qzeros = tensor[:, : q_dim]
module.self_attn.k_proj.qzeros = tensor[:, q_dim : q_dim + kv_dim]
module.self_attn.v_proj.qzeros = tensor[:, q_dim + kv_dim :]
elif bool(re.match(r"^model.layers\.\d+\.self_attn.qkv_proj\.g_idx$", name)):
elif bool(re.match(r"^model.layers\.\d+\.(self_attn.qkv_proj|self_attention.query_key_value)\.g_idx$", name)):
# model.layers.layer_id.self_attn.qkv_proj.g_ix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# model.layers.layer_id.self_attn.qkv_proj.g_ix
# model.layers.layer_id.self_attn.qkv_proj.g_idx

# model.layers.layer_id.mlp.gate_up_proj.g_idx
# model.layers.layer_id.mlp.dense_h_to_4h.g_idx
module.mlp.gate_proj.g_idx = tensor
module.mlp.up_proj.g_idx = tensor
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a condition for gate_up_proj.bias/dense_h_to_4h.bias?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants