Added Support for Vanilla and Quantized ChatGLM3 Models to Model Builder #921

amd-sudo-sh · 2024-09-24T22:01:24Z

We have integrated support for both vanilla and quantized versions of the ChatGLM3 models into the Model Builder. Additionally, we have successfully performed parity checks for both model types to ensure consistency and reliability across different configurations.

BowenBao · 2024-09-24T22:12:27Z

@kunal-vaishnavi PTAL

src/python/py/models/quantized_model.py

amd-sudo-sh · 2024-09-24T22:36:44Z

@microsoft-github-policy-service agree company="AMD"

yufenglee · 2024-09-25T16:27:26Z

src/python/py/models/builder.py

@@ -3,6 +3,7 @@
 # Licensed under the MIT License.  See License.txt in the project root for
 # license information.
 # --------------------------------------------------------------------------
+# Modifications Copyright(C) 2024 Advanced Micro Devices, Inc. All rights reserved


@snnn, could you please help check the copyright?

yufenglee · 2024-09-25T16:30:03Z

src/python/py/models/builder.py

-        self.context_length = config.max_position_embeddings
-        self.original_context_length = config.original_max_position_embeddings if hasattr(config, "original_max_position_embeddings") else config.rope_scaling["original_max_position_embeddings"] if hasattr(config, "rope_scaling") and hasattr(config.rope_scaling, "original_max_position_embeddings") else config.max_position_embeddings
+    def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options): 
+        self.context_length = config.max_position_embeddings if hasattr(config, "max_position_embeddings") else config.seq_length


Do we need to cover the case that config doesn't have seq_length

I believe all the models previously supported by the builder include max_position_embeddings in their configuration. However, since chatglm3 does not have this parameter, we introduced seq_length to ensure the configuration remains compatible.

It appears that chatglm3 has the seq_length parameter here. Since most models use config.max_position_embeddings instead of config.seq_length, can we rewrite this to the following?

self.context_length = config.seq_length if hasattr(config, "seq_length") else config.max_position_embeddings

I agree. This is a better way to write this config. Incorporated in config refactor_context_length_config .
Thanks!

yufenglee · 2024-09-25T16:31:20Z

src/python/py/models/builder.py

-        self.original_context_length = config.original_max_position_embeddings if hasattr(config, "original_max_position_embeddings") else config.rope_scaling["original_max_position_embeddings"] if hasattr(config, "rope_scaling") and hasattr(config.rope_scaling, "original_max_position_embeddings") else config.max_position_embeddings
+    def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options): 
+        self.context_length = config.max_position_embeddings if hasattr(config, "max_position_embeddings") else config.seq_length
+        self.original_context_length = config.original_max_position_embeddings if hasattr(config, "original_max_position_embeddings") else config.rope_scaling["original_max_position_embeddings"] if hasattr(config, "rope_scaling") and hasattr(config.rope_scaling, "original_max_position_embeddings") else config.max_position_embeddings if hasattr(config, "max_position_embeddings") else config.seq_length


nit: replace "config.max_position_embeddings if hasattr(config, "max_position_embeddings") else config.seq_length" with self.context_length

Great suggestion. Incorporated in the refactor_config commit

kunal-vaishnavi · 2024-09-26T18:35:29Z

src/python/py/models/builder.py

        self.window_size = config.sliding_window if hasattr(config, "sliding_window") else -1  # default is -1 in GroupQueryAttention kernel
-        self.intermediate_size = config.intermediate_size
+        self.intermediate_size = config.intermediate_size if hasattr(config, "intermediate_size") else config.ffn_hidden_size


Suggested change

self.intermediate_size = config.intermediate_size if hasattr(config, "intermediate_size") else config.ffn_hidden_size

self.intermediate_size = config.ffn_hidden_size if hasattr(config, "ffn_hidden_size") else config.intermediate_size

kunal-vaishnavi · 2024-09-26T18:36:29Z

src/python/py/models/builder.py

        self.hidden_size = config.hidden_size
        self.num_kv_heads = config.num_key_value_heads if hasattr(config, "num_key_value_heads") else config.num_attention_heads
+        self.num_kv_heads = config.multi_query_group_num if hasattr(config, "multi_query_group_num") else self.num_kv_heads


Can we merge this line with the previous one to define self.num_kv_heads once?

We can do something like this.

self.num_kv_heads = config.num_key_value_heads if hasattr(config, "num_key_value_heads") else config.multi_query_group_num if hasattr(config, "multi_query_group_num") else config.num_attention_heads

kunal-vaishnavi · 2024-09-26T18:36:48Z

src/python/py/models/builder.py

        self.hidden_size = config.hidden_size
        self.num_kv_heads = config.num_key_value_heads if hasattr(config, "num_key_value_heads") else config.num_attention_heads
+        self.num_kv_heads = config.multi_query_group_num if hasattr(config, "multi_query_group_num") else self.num_kv_heads
+        self.kv_channels = config.kv_channels if hasattr(config, "kv_channels") else self.num_kv_heads


This does not appear to be used. Can it be removed?

kunal-vaishnavi · 2024-09-26T18:39:00Z

src/python/py/models/builder.py

        self.hidden_size = config.hidden_size
        self.num_kv_heads = config.num_key_value_heads if hasattr(config, "num_key_value_heads") else config.num_attention_heads
+        self.num_kv_heads = config.multi_query_group_num if hasattr(config, "multi_query_group_num") else self.num_kv_heads
+        self.kv_channels = config.kv_channels if hasattr(config, "kv_channels") else self.num_kv_heads
+        self.multi_query_attention = config.multi_query_attention if hasattr(config, "multi_query_attention") else False


Multi-query attention can be identified with self.num_kv_heads == 1. Since it is only used once within the ChatGLMModel class and it is a small check, I don't think this is needed.

Since, the ChatGLMModel using GQA with self.num_kv_heads = 2 currently Here, I think we can use the following check to choose GQA or MHA. Please let me know if we should try something else?

self.attention_attrs["op_type"] = "GroupQueryAttention" if self.num_attn_heads > self.num_kv_heads else self.attention_attrs["op_type"]

There is a later section to handle the selection of the attention op type.

onnxruntime-genai/src/python/py/models/builder.py

Lines 229 to 247 in f768673

valid_gqa_configurations = [

("cpu", TensorProto.FLOAT),

("cuda", TensorProto.FLOAT16),

("dml", TensorProto.FLOAT16),

]

if (self.ep, self.io_dtype) in valid_gqa_configurations:

# Change model settings for GroupQueryAttention

self.attention_attrs["op_type"] = "GroupQueryAttention"

print("GroupQueryAttention (GQA) is used in this model.")

# DML doesn't support packed Q/K/V for GQA yet

self.attention_attrs["use_packed_matmul"] = self.ep != "dml"

# GQA + Rot.Emb. does not require `position ids` as input

if self.ep != "dml":

self.attention_attrs["use_rotemb_in_attn"] = True

self.input_names.remove("position_ids")

self.past_present_share_buffer = self.attention_attrs["op_type"] == "GroupQueryAttention"

This section already handles choosing between GQA and MHA as the attention op. The set of valid GQA configurations covers most precision + execution provider combinations. Since GQA is a superset of MHA, we can use GQA when self.num_attn_heads != self.num_kv_heads or self.num_attn_heads == self.num_kv_heads. Thus, I don't think we need to worry about selecting the attention op for ChatGLM.

kunal-vaishnavi · 2024-09-26T18:41:03Z

src/python/py/models/builder.py

@@ -1770,11 +1795,15 @@ def make_model(self, input_path):
                self.make_layer(self.layer_id, module)
                self.layer_id += 1

+            elif module.__class__.__name__.endswith("GLMBlock") and self.layer_id < self.num_layers:


Can we merge this condition with the previous elif block and remove this case?

elif (module.__class__.__name__.endswith("DecoderLayer") or module.__class__.__name__.endswith("GLMBlock")) and self.layer_id < self.num_layers:

Great suggestion! Yes we can.

kunal-vaishnavi · 2024-09-26T18:41:40Z

src/python/py/models/builder.py

            elif self.layer_id == self.num_layers and self.has_final_norm(module, model):
                # SkipLayerNorm after last decoder layer (MatMul --> SkipLayerNorm)
                print("Reading final norm")
                self.make_layernorm(self.layer_id, module, skip=True, simple=self.layernorm_attrs["simple"], location="final_norm")
-


The extra line space between the conditions helps make the different cases more readable. Can this line deletion be reverted?

kunal-vaishnavi · 2024-09-26T18:42:47Z

src/python/py/models/builder.py

+
+    def has_final_norm(self, module, model):
+        # Hugging Face names
+        hf_norm = hasattr(model, "model") and hasattr(model.model, "norm") and module == model.model.norm


Can the changes in ChatGLMModel.has_final_norm be upstreamed to Model.has_final_norm? Then, this method doesn't have to be re-defined.

Should we change Model.has_final_norm like this?

def has_final_norm(self, module, model): # Hugging Face names hf_norm = hasattr(model, "model") and hasattr(model.model, "norm") and module == model.model.norm if(hasattr(model, "transformer")): hf_final_layernorm = hasattr(model.transformer, "encoder") and hasattr(model.transformer.encoder, "final_layernorm") and module == model.transformer.encoder.final_layernorm else: hf_final_layernorm = hasattr(model.model, "final_layernorm") and module == model.model.final_layernorm # GGUF names gguf_final_norm = hasattr(model, "final_norm") and module == model.final_norm return hf_norm or hf_final_layernorm or gguf_final_norm

I think you can just create another variable for ChatGLM's logic and add that variable to the return logic.

def has_final_norm(self, module, model): # Hugging Face names hf_norm = hasattr(model, "model") and hasattr(model.model, "norm") and module == model.model.norm hf_final_layernorm = hasattr(model, "model") and hasattr(model.model, "final_layernorm") and module == model.model.final_layernorm hf_transformer_final_layernorm = hasattr(model, "transformer") and hasattr(model.transformer, "encoder") and hasattr(model.transformer.encoder, "final_layernorm") and module == model.transformer.encoder.final_layernorm # GGUF names gguf_final_norm = hasattr(model, "final_norm") and module == model.final_norm return hf_norm or hf_final_layernorm or hf_transformer_final_layernorm or gguf_final_norm

kunal-vaishnavi · 2024-09-26T20:50:20Z

src/python/py/models/builder.py

+class ChatGLMModel(Model):
+    def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options):
+        super().__init__(config, io_dtype, onnx_dtype, ep, cache_dir, extra_options)
+        # self.input_shapes["position_ids"] = [1]  # Note: This is optional and only needed if you want position_ids to be an int instead of a 2D tensor


Suggested change

# self.input_shapes["position_ids"] = [1] # Note: This is optional and only needed if you want position_ids to be an int instead of a 2D tensor

kunal-vaishnavi · 2024-09-26T20:53:52Z

src/python/py/models/builder.py

+    def __init__(self, config, io_dtype, onnx_dtype, ep, cache_dir, extra_options):
+        super().__init__(config, io_dtype, onnx_dtype, ep, cache_dir, extra_options)
+        # self.input_shapes["position_ids"] = [1]  # Note: This is optional and only needed if you want position_ids to be an int instead of a 2D tensor
+        self.layernorm_attrs["simple"] = True # RMSNorm in ChatGLM


The default is True so I think we can remove this.

onnxruntime-genai/src/python/py/models/builder.py

Lines 151 to 152 in f768673

self.layernorm_attrs = {

"simple": True, # Use SimplifiedLayerNorm/SkipSimplifiedLayerNorm vs. LayerNorm/SkipLayerNorm

kunal-vaishnavi · 2024-09-26T20:54:28Z

src/python/py/models/builder.py

+        self.rotemb_attrs["partial_rotary_factor"] = 0.5 # Line 755 of modeling_chatglm.py check self.rotary_pos_emb declaration
+        self.rotemb_attrs["rotary_embedding_dim"] = int(self.head_size * self.rotemb_attrs["partial_rotary_factor"])
+        self.rotemb_attrs["interleaved"] = 1
+        self.mlp_attrs["use_proj"], self.mlp_attrs["use_fc"] = True, False


The defaults are the same as below so I think we can remove this.

onnxruntime-genai/src/python/py/models/builder.py

Lines 250 to 252 in f768673

self.mlp_attrs = {

"use_proj": True, # Use projection style for MLP (GateProj/UpProj/DownProj)

"use_fc": False, # Use fully-connected style for MLP (FC1/FC2)

src/python/py/models/builder.py

kunal-vaishnavi · 2024-09-26T21:03:44Z

src/python/py/models/builder.py

        else:
-            raise NotImplementedError(f"The {hf_name} model is not currently supported.")
+            raise NotImplementedError(
+                f"The {hf_name} model is not currently supported. Got {config}"


Why add the config to the error?

Removed the config from the error. I think it was used for debugging internally.

kunal-vaishnavi · 2024-09-26T21:27:49Z

src/python/py/models/builder.py

+        self.rotemb_attrs["interleaved"] = 1
+        self.mlp_attrs["use_proj"], self.mlp_attrs["use_fc"] = True, False
+        self.attention_attrs["use_rotemb_in_attn"] = True
+        self.attention_attrs["use_packed_matmul"] = True


The combination of these attention_attrs settings will cause the FP32 CUDA model to be created incorrectly. For the FP32 CUDA model, the GroupQueryAttention op does not have an implementation. The fallback is to use the MultiHeadAttention op, which does not fuse rotary embeddings within it and does not handle the extra repeat_kv operation needed when num_attention_heads != num_key_value_heads. Therefore, the FP32 CUDA model would need the following settings.

use_rotemb_in_attn = False

use_packed_matmul = False

op_type = "MultiHeadAttention"

The subgraph for FP32 CUDA should look something like this.

# MultiHeadAttention example: # # root_input # / | \ # Q_MatMul K_MatMul V_MatMul 4D causal mask past_key past_value # | | | | | | # Q_Rotary K_Rotary | +------------+-----------+ # | | | | # | Repeat_K Repeat_V | # \ | / | # MultiHeadAttention--------------------------+ # | # O_MatMul # | # O_Add

This relates to #880, for our use case, we would like GroupQueryAttention and our custom ep would have an implementation for it. I'm ok with changing the default behavior to adhere what is available in CPU/CUDA, as long as there is a way to override and generate with selected ops.

kunal-vaishnavi · 2024-10-02T17:59:59Z

src/python/py/models/builder.py

@@ -1432,14 +1433,21 @@ def make_mlp(self, layer_id, mlp, root_input):
            raise NotImplementedError(f"The MLP layer type is not set.")

    def make_mlp_unpacked(self, layer_id, mlp, root_input):
+        packed_proj = getattr(mlp, "gate_up_proj", None) or getattr(
+            mlp, "dense_h_to_4h", None


Can these definitions be made in one line?

kunal-vaishnavi · 2024-10-02T18:01:56Z

src/python/py/models/builder.py

+        self.rotemb_attrs["rotary_embedding_dim"] = int(self.head_size * self.rotemb_attrs["partial_rotary_factor"])
+        self.rotemb_attrs["interleaved"] = 1
+        self.attention_attrs["use_rotemb_in_attn"] = True
+        self.attention_attrs["use_packed_matmul"] = True


The self.attention_attrs["use_rotemb_in_attn"] = True and self.attention_attrs["use_packed_matmul"] = True lines can be deleted since these will be automatically set when the attention op is set.

kunal-vaishnavi · 2024-10-02T18:04:54Z

src/python/py/models/builder.py

@@ -1469,8 +1477,11 @@ def make_mlp_proj(self, layer_id, mlp, root_input):
        self.make_mul(mul_name, mul_inputs, dtype=self.io_dtype, shape=["batch_size", "sequence_length", self.intermediate_size])

        # Make output MatMul node
+        down_proj = getattr(mlp, "down_proj", None) or getattr(
+            mlp, "dense_4h_to_h", None


Can this definition be made in one line?

kunal-vaishnavi · 2024-10-02T18:05:00Z

src/python/py/models/quantized_model.py

@@ -94,31 +95,42 @@ def __init__(self, quant_type, input_path, bits, group_size, q_size, kv_size, in
        layer_id = 0
        for weight_file in os.listdir(input_path):
            if weight_file.endswith(".safetensors"):
-                module = self.layers.setdefault(layer_id, QuantizedDecoderLayer(layer_id, bits, group_size))
+                module = self.layers.setdefault(
+                    layer_id, QuantizedDecoderLayer(layer_id, bits, group_size)


Can this definition be made in one line?

kunal-vaishnavi · 2024-10-02T18:05:46Z

src/python/py/models/quantized_model.py

-                            module = self.layers.setdefault(layer_id, QuantizedDecoderLayer(layer_id, bits, group_size))
+                            module = self.layers.setdefault(
+                                layer_id,
+                                QuantizedDecoderLayer(layer_id, bits, group_size),


Can this definition be made in one line?

kunal-vaishnavi · 2024-10-02T18:15:13Z

src/python/py/models/quantized_model.py

                            q_dim = q_size // (32 // bits) if quant_type in {"awq", "gptq"} else q_size
                            kv_dim = kv_size // (32 // bits) if quant_type in {"awq", "gptq"} else kv_size
                            module.self_attn.q_proj.qzeros = tensor[:, : q_dim]
                            module.self_attn.k_proj.qzeros = tensor[:, q_dim : q_dim + kv_dim]
                            module.self_attn.v_proj.qzeros = tensor[:, q_dim + kv_dim :]
-                        elif bool(re.match(r"^model.layers\.\d+\.self_attn.qkv_proj\.g_idx$", name)):
+                        elif bool(re.match(r"^model.layers\.\d+\.(self_attn.qkv_proj|self_attention.query_key_value)\.g_idx$", name)):
                            # model.layers.layer_id.self_attn.qkv_proj.g_ix


Suggested change

# model.layers.layer_id.self_attn.qkv_proj.g_ix

# model.layers.layer_id.self_attn.qkv_proj.g_idx

kunal-vaishnavi · 2024-10-02T18:20:08Z

src/python/py/models/quantized_model.py

                            # model.layers.layer_id.mlp.gate_up_proj.g_idx
+                            # model.layers.layer_id.mlp.dense_h_to_4h.g_idx
                            module.mlp.gate_proj.g_idx = tensor
                            module.mlp.up_proj.g_idx = tensor
                        else:


Should we add a condition for gate_up_proj.bias/dense_h_to_4h.bias?

amd-sudo-sh and others added 12 commits September 23, 2024 22:36

builder_support_for_chatglm_1

d1b26a2

builder_changes

a51d036

parity_checked_chatglm_model

bc46b1c

refractor_codebase

1c90219

refractor_codebase

dfdbf4f

chatglm_support_for_model_builder

b20e07b

complete chatglm3

0bdf843

Cleanup

eb10e51

minor_updates

938535c

comment

5e39727

correct_usernames

cfac49c

cleanup

e70dcfb

BowenBao reviewed Sep 24, 2024

View reviewed changes

src/python/py/models/quantized_model.py Outdated Show resolved Hide resolved

fixed_license_headers

44a5178

rm_unused_make_split

d8eb982

BowenBao mentioned this pull request Sep 24, 2024

Extend model type with ChatGLM3 #922

Open

yufenglee requested review from kunal-vaishnavi and snnn September 25, 2024 16:24

yufenglee reviewed Sep 25, 2024

View reviewed changes

amd-sudo-sh added 2 commits September 26, 2024 17:17

refactor_config

e116fa4

refactor_context_length_config

f768673

kunal-vaishnavi reviewed Sep 26, 2024

View reviewed changes

src/python/py/models/builder.py Show resolved Hide resolved

kunal-vaishnavi reviewed Sep 26, 2024

View reviewed changes

cleanup_based_on_comments

7eee24d

kunal-vaishnavi reviewed Oct 2, 2024

View reviewed changes

	self.intermediate_size = config.intermediate_size if hasattr(config, "intermediate_size") else config.ffn_hidden_size
	self.intermediate_size = config.ffn_hidden_size if hasattr(config, "ffn_hidden_size") else config.intermediate_size

	valid_gqa_configurations = [
	("cpu", TensorProto.FLOAT),
	("cuda", TensorProto.FLOAT16),
	("dml", TensorProto.FLOAT16),
	]
	if (self.ep, self.io_dtype) in valid_gqa_configurations:
	# Change model settings for GroupQueryAttention
	self.attention_attrs["op_type"] = "GroupQueryAttention"
	print("GroupQueryAttention (GQA) is used in this model.")

	# DML doesn't support packed Q/K/V for GQA yet
	self.attention_attrs["use_packed_matmul"] = self.ep != "dml"

	# GQA + Rot.Emb. does not require `position ids` as input
	if self.ep != "dml":
	self.attention_attrs["use_rotemb_in_attn"] = True
	self.input_names.remove("position_ids")

	self.past_present_share_buffer = self.attention_attrs["op_type"] == "GroupQueryAttention"

	self.layernorm_attrs = {
	"simple": True, # Use SimplifiedLayerNorm/SkipSimplifiedLayerNorm vs. LayerNorm/SkipLayerNorm

	self.mlp_attrs = {
	"use_proj": True, # Use projection style for MLP (GateProj/UpProj/DownProj)
	"use_fc": False, # Use fully-connected style for MLP (FC1/FC2)

	# model.layers.layer_id.self_attn.qkv_proj.g_ix
	# model.layers.layer_id.self_attn.qkv_proj.g_idx

Added Support for Vanilla and Quantized ChatGLM3 Models to Model Builder #921

Are you sure you want to change the base?

Added Support for Vanilla and Quantized ChatGLM3 Models to Model Builder #921

Conversation

amd-sudo-sh commented Sep 24, 2024

BowenBao commented Sep 24, 2024

amd-sudo-sh commented Sep 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment