make mistral3 pass on xpu by yao-matrix · Pull Request #37882 · huggingface/transformers

yao-matrix · 2025-04-30T08:03:06Z

@ydshieh , pls help review, thx

github-actions · 2025-04-30T08:03:18Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

Rocketknight1 · 2025-04-30T12:10:21Z

cc @IlyasMoutawwakil

ydshieh

Thanks.

Looks line non of those integration tests could be run on T4, all GPU OOM.

Let me try first if we have other workaround.

ydshieh · 2025-04-30T14:59:19Z

Would you be up to try

git fetch https://github.com/yao-matrix/transformers.git mistral3-xpu-cpu-offload:mistral3-xpu-cpu-offload && git checkout mistral3-xpu-cpu-offload

and run the integration tests? I am using CPU offload, so 3 tests can run on A10.

I can't make test_mistral3_integration_batched_generate_multi_image work as it will OOM without 4-bit and other errors with cpu-offload + 4-bit.

FAILED tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate_multi_image - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8

On T4, it hangs forever ...

ydshieh · 2025-04-30T16:16:40Z

Thanks.

Looks line non of those integration tests could be run on T4, all GPU OOM.

Let me try first if we have other workaround.

Hmm, I am able to avoid OOM for test_mistral3_integration_batched_generate_multi_image by using smaller images.

But before I move forward, it would be nice if you can check if this cpu_offload works with xpu 🙏 ?

yao-matrix · 2025-05-05T23:34:55Z

Would you be up to try

git fetch https://github.com/yao-matrix/transformers.git mistral3-xpu-cpu-offload:mistral3-xpu-cpu-offload && git checkout mistral3-xpu-cpu-offload

and run the integration tests? I am using CPU offload, so 3 tests can run on A10.

I can't make test_mistral3_integration_batched_generate_multi_image work as it will OOM without 4-bit and other errors with cpu-offload + 4-bit.

FAILED tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate_multi_image - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8

On T4, it hangs forever ...

@ydshieh , sorry for late response, just back to office from a 5-day holiday. Yes, I can try you offload changes, but it seems I cannot get the mistral3-xpu-cpu-offload branch when I run your command, could you help check it? Thx.

ydshieh · 2025-05-06T15:16:59Z

Sorry, it should be

git fetch https://github.com/huggingface/transformers.git mistral3-xpu-cpu-offload:mistral3-xpu-cpu-offload && git checkout mistral3-xpu-cpu-offload

You don't need to update the expected values, just to see if xpu works well with cpu offloading.
Once I change the input images to new smaller one, we can check the expected values again 🙏

ydshieh · 2025-05-06T16:38:50Z

BTW, it seems this cpu offload will produce different outputs on sigle-device v.s. multi-device environment.

I have to set execution_device="cuda:0") to get the same outputs in both cases.

Anyway, let's see if it could at least run on xpu without error then we can adjust the outputs.

yao-matrix · 2025-05-07T06:54:34Z

@ydshieh , I tested the 4 cases w/ cpu_offload() on XPU:

2 cases failed for ground-truth mismatch, which are fine
tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate
tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate_multi_image
2 cases passed
tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_generate
tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_generate_text_only

So it's work on XPU.

Yet, I found the test will be pretty slow after enabling cpu offload, and if we run 4 cases in one pytest command, the process is easy to hang(using pytest -rA tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest), and I found process VIRT memory will be tens of T(33.6T in one of my experiment), separately run each case is fine. Don't know whether you can observe similar in your env.

If you can observe similar w/ me, I will think it's not suitable to use cpu_offload since it has risk to break the CI. And maybe we can consider the require_big_accelerator decorator used by diffusers to specify these cases only run on accelerator whose VRAM are larger than a given number.

ydshieh · 2025-05-07T08:13:44Z

You mean 33.6G instead of 33.6T, right? Yes, it is (CPU) memory hungry, but not necessary slower than running GPU (let me double check).

BTW, how many CPU RAM available on your XPU machine? In my cases, I need to ask our infra to provide single T4 with 64G.
Also, how much is your XPU accelerator's processing RAM? On A10 24GB, I also get one or two of them failing (OOM).

yao-matrix · 2025-05-07T22:42:12Z

You mean 33.6G instead of 33.6T, right? Yes, it is (CPU) memory hungry, but not necessary slower than running GPU (let me double check).

BTW, how many CPU RAM available on your XPU machine? In my cases, I need to ask our infra to provide single T4 with 64G. Also, how much is your XPU accelerator's processing RAM? On A10 24GB, I also get one or two of them failing (OOM).

Actually it's 33.6 T shows in my VIRT, so when run into 3rd case, it often hangs. And I validated in my A100 env, it's OK. So, it's my env issue, you can ignore it.

I am using the Ponte Vecchio 1150 which has 64GB VRAM, suppose it's enough.

faaany · 2025-05-08T06:12:21Z

@ydshieh , I tested the 4 cases w/ cpu_offload() on XPU:

2 cases failed for ground-truth mismatch, which are fine
tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate
tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate_multi_image

2 cases passed
tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_generate
tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_generate_text_only

So it's work on XPU.

Yet, I found the test will be pretty slow after enabling cpu offload, and if we run 4 cases in one pytest command, the process is easy to hang(using pytest -rA tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest), and I found process VIRT memory will be tens of T(33.6T in one of my experiment), separately run each case is fine. Don't know whether you can observe similar in your env.

If you can observe similar w/ me, I will think it's not suitable to use cpu_offload since it has risk to break the CI. And maybe we can consider the require_big_accelerator decorator used by diffusers to specify these cases only run on accelerator whose VRAM are larger than a given number.

In my env, I found the slowness mainly comes from model downloading. Once the model is downloaded, the test can pass pretty fast.

yao-matrix · 2025-05-08T07:59:42Z

@ydshieh , I tested the 4 cases w/ cpu_offload() on XPU:

2 cases failed for ground-truth mismatch, which are fine
tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate
tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate_multi_image

2 cases passed
tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_generate
tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_generate_text_only

So it's work on XPU.
Yet, I found the test will be pretty slow after enabling cpu offload, and if we run 4 cases in one pytest command, the process is easy to hang(using pytest -rA tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest), and I found process VIRT memory will be tens of T(33.6T in one of my experiment), separately run each case is fine. Don't know whether you can observe similar in your env.
If you can observe similar w/ me, I will think it's not suitable to use cpu_offload since it has risk to break the CI. And maybe we can consider the require_big_accelerator decorator used by diffusers to specify these cases only run on accelerator whose VRAM are larger than a given number.

In my env, I found the slowness mainly comes from model downloading. Once the model is downloaded, the test can pass pretty fast.

so, it's my env issue. Thx @faaany for testing.

ydshieh · 2025-05-08T08:04:16Z

OK, thank you both very much. Let's try cpu offloading so T4 and A10 and use smaller images can run (most) of them. I will push some commits back to this PR.

Signed-off-by: Yao Matrix <matrix.yao@intel.com>

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

ydshieh · 2025-05-08T17:29:46Z

Hi @yao-matrix (after deleting the local mistral3-xpu if it is still on your local system)

git fetch https://github.com/yao-matrix/transformers.git mistral3-xpu:mistral3-xpu && git checkout mistral3-xpu

This will run on both T4 16G and A10 24G without any GPU OOM and match the expect values.

If you want to keep tests running directly on XPU without using CPU offload, you can tweak

    def setUp(self):
        cleanup(torch_device, gc_collect=True)
        self.model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
        self.model = Mistral3ForConditionalGeneration.from_pretrained(
            self.model_checkpoint, torch_dtype=torch.bfloat16
        )
        accelerate.cpu_offload(self.model, execution_device=torch_device)

with an if / else. Once you are happy with the results, ping me and I will merge.

yao-matrix · 2025-05-08T23:46:32Z

@ydshieh , cool, OK from my side. Pls feel free to merge.

ydshieh · 2025-05-09T06:23:47Z

I think XPU will still get 2 cases failed for ground-truth mismatch as you mentioned (but you said which are fine).
I will merge as it is then if XPU needs an update of expected values, we can do it in another PR.

Thank you for the patience.

HuggingFaceDocBuilderDev · 2025-05-09T06:41:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* enabled mistral3 test cases on XPU Signed-off-by: Yao Matrix <matrix.yao@intel.com> * calibrate A100 expectation Signed-off-by: YAO Matrix <matrix.yao@intel.com> * update * update * update * update * update * update --------- Signed-off-by: Yao Matrix <matrix.yao@intel.com> Signed-off-by: YAO Matrix <matrix.yao@intel.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

github-actions bot marked this pull request as draft April 30, 2025 08:03

ydshieh approved these changes Apr 30, 2025

View reviewed changes

yao-matrix marked this pull request as ready for review May 6, 2025 05:25

yao-matrix and others added 4 commits May 8, 2025 19:13

enabled mistral3 test cases on XPU

d8aaff1

Signed-off-by: Yao Matrix <matrix.yao@intel.com>

calibrate A100 expectation

8dac059

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

update

b318703

update

5887a6a

ydshieh force-pushed the mistral3-xpu branch from 1f942c5 to 5887a6a Compare May 8, 2025 17:33

ydshieh and others added 3 commits May 8, 2025 19:48

update

9a399ca

update

5a1fc32

Merge branch 'main' into mistral3-xpu

c57699e

ydshieh added 2 commits May 9, 2025 08:28

update

6ed18b0

update

202fa4c

ydshieh enabled auto-merge (squash) May 9, 2025 06:29

ydshieh merged commit 1dfad4b into huggingface:main May 9, 2025
14 checks passed

yao-matrix deleted the mistral3-xpu branch May 9, 2025 07:06

ydshieh mentioned this pull request Jun 23, 2025

fix mistral and mistral3 tests #38978

Merged

Conversation

yao-matrix commented Apr 30, 2025

Uh oh!

github-actions bot commented Apr 30, 2025

Uh oh!

Rocketknight1 commented Apr 30, 2025

Uh oh!

ydshieh left a comment

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh commented Apr 30, 2025

Uh oh!

yao-matrix commented May 5, 2025

Uh oh!

ydshieh commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh commented May 6, 2025

Uh oh!

yao-matrix commented May 7, 2025

Uh oh!

ydshieh commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yao-matrix commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

faaany commented May 8, 2025

Uh oh!

yao-matrix commented May 8, 2025

Uh oh!

ydshieh commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ydshieh commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yao-matrix commented May 8, 2025

Uh oh!

ydshieh commented May 9, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ydshieh commented Apr 30, 2025 •

edited

Loading

ydshieh commented May 6, 2025 •

edited

Loading

ydshieh commented May 7, 2025 •

edited

Loading

yao-matrix commented May 7, 2025 •

edited

Loading

ydshieh commented May 8, 2025 •

edited

Loading

ydshieh commented May 8, 2025 •

edited

Loading