Skip to content

fix gemma-2-27b text generation pytest#1828

Closed
skaulintel wants to merge 1 commit into
transformers_4_49from
skaulintel/gemma2_pytest_fix
Closed

fix gemma-2-27b text generation pytest#1828
skaulintel wants to merge 1 commit into
transformers_4_49from
skaulintel/gemma2_pytest_fix

Conversation

@skaulintel
Copy link
Copy Markdown
Contributor

fixes the following pytest

python -m pytest tests/test_text_generation_example.py tests/test_encoder_decoder.py -v -s -k "gemma-2-27b and test_text_generation_bf16_1x" --token=****

without it, get the following assertionerror

E           AssertionError: assert False
E            +  where False = <built-in function eq>('DeepSpeed is a machine learning framework that allows you to train deep learning models at any scale, from a single GPU to thousands of GPUs. It is a system that allows you to train models in a distributed environment.\n\nDeepSpeed is a deep learning training system that allows you to train models in a distributed environment. It is a system that allows you to train models in a distributed environment.\n\nThe DeepSpeed system is a deep learning training system that is designed to help you train deep learning models in a distributed environment.\n\nThe Deep', 'DeepSpeed is a machine learning framework that enables you to train models with trillions of parameters and beyond, using model parallelism to partition large models over multiple GPUs.\n\nThe following is a brief introduction to the DeepSpeed model parallel training.\n\n<h2>1. Introduction</h2>\n\nThe DeepSpeed model parallel training is a simple and effective way to train large models. It is a framework that enables you to train models with trillions of parameters and beyond.\n\nDeepSpeed is a distributed deep learning optimization toolkit that makes it easy and efficient')

conftest.py:74: AssertionError
========================================================================================== short test summary info ===========================================================================================
FAILED tests/test_text_generation_example.py::test_text_generation_bf16_1x[google/gemma-2-27b-1-False-True] - AssertionError: assert False

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Mar 7, 2025

I don't think there is an issue with Gemma2. The reason why I added the code block

if self.config.final_logit_softcapping is not None:
    ...

is because it has been in Transformers since Gemma2 was added. I'm not sure why it was not included here in #1280 and #1504 (any idea @billishyahao @Luca-Calabria ?).

final_logit_softcapping is actually specified in the configuration of the model so this piece of code is indeed used.

Moreover, the output of the model with this change still makes sense:

DeepSpeed is a machine learning framework that allows you to train deep learning models at any scale, from a single GPU to thousands of GPUs. It is a system that allows you to train models in a distributed environment.\n\nDeepSpeed is a deep learning training system that allows you to train models in a distributed environment. It is a system that allows you to train models in a distributed environment.\n\nThe DeepSpeed system is a deep learning training system that is designed to help you train deep learning models in a distributed environment.\n\nThe Deep

I think what we should do here is rather to update the baseline here:

"output": "DeepSpeed is a machine learning framework that enables you to train models with trillions of parameters and beyond, using model parallelism to partition large models over multiple GPUs.\n\nThe following is a brief introduction to the DeepSpeed model parallel training.\n\n<h2>1. Introduction</h2>\n\nThe DeepSpeed model parallel training is a simple and effective way to train large models. It is a framework that enables you to train models with trillions of parameters and beyond.\n\nDeepSpeed is a distributed deep learning optimization toolkit that makes it easy and efficient",

@uartie
Copy link
Copy Markdown
Contributor

uartie commented Mar 7, 2025

I think what we should do here is rather to update the baseline here:

"output": "DeepSpeed is a machine learning framework that enables you to train models with trillions of parameters and beyond, using model parallelism to partition large models over multiple GPUs.\n\nThe following is a brief introduction to the DeepSpeed model parallel training.\n\n<h2>1. Introduction</h2>\n\nThe DeepSpeed model parallel training is a simple and effective way to train large models. It is a framework that enables you to train models with trillions of parameters and beyond.\n\nDeepSpeed is a distributed deep learning optimization toolkit that makes it easy and efficient",

You can use rebase to update baseline:

python -m pytest --rebase tests/test_text_generation_example.py::test_text_generation_bf16_1x[google/gemma-2-27b-1-False-True]

@skaulintel
Copy link
Copy Markdown
Contributor Author

I don't think there is an issue with Gemma2. The reason why I added the code block

if self.config.final_logit_softcapping is not None:
    ...

is because it has been in Transformers since Gemma2 was added. I'm not sure why it was not included here in #1280 and #1504 (any idea @billishyahao @Luca-Calabria ?).

final_logit_softcapping is actually specified in the configuration of the model so this piece of code is indeed used.

Moreover, the output of the model with this change still makes sense:

DeepSpeed is a machine learning framework that allows you to train deep learning models at any scale, from a single GPU to thousands of GPUs. It is a system that allows you to train models in a distributed environment.\n\nDeepSpeed is a deep learning training system that allows you to train models in a distributed environment. It is a system that allows you to train models in a distributed environment.\n\nThe DeepSpeed system is a deep learning training system that is designed to help you train deep learning models in a distributed environment.\n\nThe Deep

I think what we should do here is rather to update the baseline here:

"output": "DeepSpeed is a machine learning framework that enables you to train models with trillions of parameters and beyond, using model parallelism to partition large models over multiple GPUs.\n\nThe following is a brief introduction to the DeepSpeed model parallel training.\n\n<h2>1. Introduction</h2>\n\nThe DeepSpeed model parallel training is a simple and effective way to train large models. It is a framework that enables you to train models with trillions of parameters and beyond.\n\nDeepSpeed is a distributed deep learning optimization toolkit that makes it easy and efficient",

it makes sense, but there seems to be a lot of repetition. The output before this change seemed a little better.

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Mar 7, 2025

This happens with greedy search, especially with models that have not been instruction fine-tuned. I'll take a look to see how to get more realistic results by tweaking a few generation parameters.

@Luca-Calabria
Copy link
Copy Markdown
Contributor

I don't think there is an issue with Gemma2. The reason why I added the code block

if self.config.final_logit_softcapping is not None:
    ...

is because it has been in Transformers since Gemma2 was added. I'm not sure why it was not included here in #1280 and #1504 (any idea @billishyahao @Luca-Calabria ?).

final_logit_softcapping is actually specified in the configuration of the model so this piece of code is indeed used.

Moreover, the output of the model with this change still makes sense:

DeepSpeed is a machine learning framework that allows you to train deep learning models at any scale, from a single GPU to thousands of GPUs. It is a system that allows you to train models in a distributed environment.\n\nDeepSpeed is a deep learning training system that allows you to train models in a distributed environment. It is a system that allows you to train models in a distributed environment.\n\nThe DeepSpeed system is a deep learning training system that is designed to help you train deep learning models in a distributed environment.\n\nThe Deep

I think what we should do here is rather to update the baseline here:

"output": "DeepSpeed is a machine learning framework that enables you to train models with trillions of parameters and beyond, using model parallelism to partition large models over multiple GPUs.\n\nThe following is a brief introduction to the DeepSpeed model parallel training.\n\n<h2>1. Introduction</h2>\n\nThe DeepSpeed model parallel training is a simple and effective way to train large models. It is a framework that enables you to train models with trillions of parameters and beyond.\n\nDeepSpeed is a distributed deep learning optimization toolkit that makes it easy and efficient",

I have not a clear answer why it was not part of Gemma2 enabling PRs, but if this block was part of transformers and was not integrated on Gemma2 for Gaudi then it is something to add.
The baseline should be changed accordingly the new output.

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Mar 10, 2025

@skaulintel It seems casting the logits to float when they are extracted from the forward pass of the model solves it: 02c4aa0#diff-c7b7c0b91ade41a0c87f1ad1f6784e4d51fb88c6a65f350042aca052b7ca1558R960

This used to be done in previous versions of Transformers. Now they have removed it but it seems it slightly affects a few models on Gaudi. So I reverted this change in the commit posted above. Closing this PR.

@regisss regisss closed this Mar 10, 2025
@skaulintel
Copy link
Copy Markdown
Contributor Author

@skaulintel It seems casting the logits to float when they are extracted from the forward pass of the model solves it: 02c4aa0#diff-c7b7c0b91ade41a0c87f1ad1f6784e4d51fb88c6a65f350042aca052b7ca1558R960

This used to be done in previous versions of Transformers. Now they have removed it but it seems it slightly affects a few models on Gaudi. So I reverted this change in the commit posted above. Closing this PR.

So do we need to update the corresponding unit test?

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Mar 10, 2025

So do we need to update the corresponding unit test?

Nope, since it generates the exact same output as before when using the cast to float

@skaulintel
Copy link
Copy Markdown
Contributor Author

skaulintel commented Mar 10, 2025

So do we need to update the corresponding unit test?

Nope, since it generates the exact same output as before when using the cast to float

That doesn't seem to be the case for me. I collected some data on gaudi3:

transformers_4_49 commit 6edca72:

'DeepSpeed is a machine learning framework that enables you to train large models on a single GPU. It is a framework that is used to train large models on a single GPU.\n\nThe main idea is to use a large amount of memory to fit the model on a single GPU.\n\nThe main idea of \u200b\u200bthe algorithm is to use the gradient of the loss function to update the model parameters.\n\nThe main idea of \u200b\u200bthe algorithm is to use the gradient of the loss function to update the model parameters.\n\nThe main idea of'

transformers_4_49 commit 11140b2:

'DeepSpeed is a machine learning framework that is designed to help you train your models faster and more efficiently. It is a collection of multi-GPU training techniques that can be used together or separately to improve the performance of your model.\n\nDeepSpeed is a system that allows you to train your models faster and more efficiently.\n\n<h2>What is DeepSpeed?</h2>\n\nDeepSpeed is a deep learning optimization toolkit that makes it easier to enable and customize deep learning optimization. It offers 1-2.5x speed increase compared to other'

reference, which i think we should update? :

"DeepSpeed is a machine learning framework that enables you to train large models on a single GPU. It is a framework that is used to train large models on a single GPU.\n\nThe main idea is to use a large amount of memory to fit the model on a single GPU.\n\nThe main idea is to use a large amount of memory to fit the model on a single GPU.\n\nThe main idea is to use a large amount of memory to fit the model on a single GPU.\n\nDeepSpeed is a framework that allows you"

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Mar 11, 2025

That doesn't seem to be the case for me. I collected some data on gaudi3:

transformers_4_49 commit 6edca72:

'DeepSpeed is a machine learning framework that enables you to train large models on a single GPU. It is a framework that is used to train large models on a single GPU.\n\nThe main idea is to use a large amount of memory to fit the model on a single GPU.\n\nThe main idea of \u200b\u200bthe algorithm is to use the gradient of the loss function to update the model parameters.\n\nThe main idea of \u200b\u200bthe algorithm is to use the gradient of the loss function to update the model parameters.\n\nThe main idea of'

transformers_4_49 commit 11140b2:

'DeepSpeed is a machine learning framework that is designed to help you train your models faster and more efficiently. It is a collection of multi-GPU training techniques that can be used together or separately to improve the performance of your model.\n\nDeepSpeed is a system that allows you to train your models faster and more efficiently.\n\n<h2>What is DeepSpeed?</h2>\n\nDeepSpeed is a deep learning optimization toolkit that makes it easier to enable and customize deep learning optimization. It offers 1-2.5x speed increase compared to other'

reference, which i think we should update? :

"DeepSpeed is a machine learning framework that enables you to train large models on a single GPU. It is a framework that is used to train large models on a single GPU.\n\nThe main idea is to use a large amount of memory to fit the model on a single GPU.\n\nThe main idea is to use a large amount of memory to fit the model on a single GPU.\n\nThe main idea is to use a large amount of memory to fit the model on a single GPU.\n\nDeepSpeed is a framework that allows you"

I thought I added the change for Mixtral too, that was not the case, #1839 should solve it

edit: ah wait this is gemma2, let me see

edit2: okay I only used Gaudi2, that's why I didn't meet the same issue. I just pushed 96c8a32 to correct the Gaudi3 baseline, let me know if that works for you

@skaulintel
Copy link
Copy Markdown
Contributor Author

python -m pytest tests/test_text_generation_example.py tests/test_encoder_decoder.py -v -s -k "gemma-2-27b and test_text_generation_bf16_1x" --token=

Yes, it works for me now. Thanks!

@regisss regisss deleted the skaulintel/gemma2_pytest_fix branch March 11, 2025 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants