Skip to content

make mistral3 pass on xpu#37882

Merged
ydshieh merged 9 commits intohuggingface:mainfrom
yao-matrix:mistral3-xpu
May 9, 2025
Merged

make mistral3 pass on xpu#37882
ydshieh merged 9 commits intohuggingface:mainfrom
yao-matrix:mistral3-xpu

Conversation

@yao-matrix
Copy link
Contributor

@ydshieh , pls help review, thx

@github-actions
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@github-actions github-actions bot marked this pull request as draft April 30, 2025 08:03
@Rocketknight1
Copy link
Member

cc @IlyasMoutawwakil

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Looks line non of those integration tests could be run on T4, all GPU OOM.

Let me try first if we have other workaround.

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 30, 2025

Would you be up to try

git fetch https://github.com/yao-matrix/transformers.git mistral3-xpu-cpu-offload:mistral3-xpu-cpu-offload && git checkout mistral3-xpu-cpu-offload

and run the integration tests? I am using CPU offload, so 3 tests can run on A10.

I can't make test_mistral3_integration_batched_generate_multi_image work as it will OOM without 4-bit and other errors with cpu-offload + 4-bit.

FAILED tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate_multi_image - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8

On T4, it hangs forever ...

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 30, 2025

Thanks.

Looks line non of those integration tests could be run on T4, all GPU OOM.

Let me try first if we have other workaround.

Hmm, I am able to avoid OOM for test_mistral3_integration_batched_generate_multi_image by using smaller images.

But before I move forward, it would be nice if you can check if this cpu_offload works with xpu 🙏 ?

@yao-matrix
Copy link
Contributor Author

Would you be up to try

git fetch https://github.com/yao-matrix/transformers.git mistral3-xpu-cpu-offload:mistral3-xpu-cpu-offload && git checkout mistral3-xpu-cpu-offload

and run the integration tests? I am using CPU offload, so 3 tests can run on A10.

I can't make test_mistral3_integration_batched_generate_multi_image work as it will OOM without 4-bit and other errors with cpu-offload + 4-bit.

FAILED tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate_multi_image - ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8

On T4, it hangs forever ...

@ydshieh , sorry for late response, just back to office from a 5-day holiday. Yes, I can try you offload changes, but it seems I cannot get the mistral3-xpu-cpu-offload branch when I run your command, could you help check it? Thx.

@yao-matrix yao-matrix marked this pull request as ready for review May 6, 2025 05:25
@ydshieh
Copy link
Collaborator

ydshieh commented May 6, 2025

Sorry, it should be

git fetch https://github.com/huggingface/transformers.git mistral3-xpu-cpu-offload:mistral3-xpu-cpu-offload && git checkout mistral3-xpu-cpu-offload

You don't need to update the expected values, just to see if xpu works well with cpu offloading.
Once I change the input images to new smaller one, we can check the expected values again 🙏

@ydshieh
Copy link
Collaborator

ydshieh commented May 6, 2025

BTW, it seems this cpu offload will produce different outputs on sigle-device v.s. multi-device environment.

I have to set execution_device="cuda:0") to get the same outputs in both cases.

Anyway, let's see if it could at least run on xpu without error then we can adjust the outputs.

@yao-matrix
Copy link
Contributor Author

@ydshieh , I tested the 4 cases w/ cpu_offload() on XPU:

  • 2 cases failed for ground-truth mismatch, which are fine
    tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate
    tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate_multi_image

  • 2 cases passed
    tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_generate
    tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_generate_text_only

So it's work on XPU.

Yet, I found the test will be pretty slow after enabling cpu offload, and if we run 4 cases in one pytest command, the process is easy to hang(using pytest -rA tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest), and I found process VIRT memory will be tens of T(33.6T in one of my experiment), separately run each case is fine. Don't know whether you can observe similar in your env.

If you can observe similar w/ me, I will think it's not suitable to use cpu_offload since it has risk to break the CI. And maybe we can consider the require_big_accelerator decorator used by diffusers to specify these cases only run on accelerator whose VRAM are larger than a given number.

@ydshieh
Copy link
Collaborator

ydshieh commented May 7, 2025

You mean 33.6G instead of 33.6T, right? Yes, it is (CPU) memory hungry, but not necessary slower than running GPU (let me double check).

BTW, how many CPU RAM available on your XPU machine? In my cases, I need to ask our infra to provide single T4 with 64G.
Also, how much is your XPU accelerator's processing RAM? On A10 24GB, I also get one or two of them failing (OOM).

@yao-matrix
Copy link
Contributor Author

yao-matrix commented May 7, 2025

You mean 33.6G instead of 33.6T, right? Yes, it is (CPU) memory hungry, but not necessary slower than running GPU (let me double check).

BTW, how many CPU RAM available on your XPU machine? In my cases, I need to ask our infra to provide single T4 with 64G. Also, how much is your XPU accelerator's processing RAM? On A10 24GB, I also get one or two of them failing (OOM).

Actually it's 33.6 T shows in my VIRT, so when run into 3rd case, it often hangs. And I validated in my A100 env, it's OK. So, it's my env issue, you can ignore it.

I am using the Ponte Vecchio 1150 which has 64GB VRAM, suppose it's enough.

image

@faaany
Copy link
Contributor

faaany commented May 8, 2025

@ydshieh , I tested the 4 cases w/ cpu_offload() on XPU:

  • 2 cases failed for ground-truth mismatch, which are fine
    tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate
    tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate_multi_image
  • 2 cases passed
    tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_generate
    tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_generate_text_only

So it's work on XPU.

Yet, I found the test will be pretty slow after enabling cpu offload, and if we run 4 cases in one pytest command, the process is easy to hang(using pytest -rA tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest), and I found process VIRT memory will be tens of T(33.6T in one of my experiment), separately run each case is fine. Don't know whether you can observe similar in your env.

If you can observe similar w/ me, I will think it's not suitable to use cpu_offload since it has risk to break the CI. And maybe we can consider the require_big_accelerator decorator used by diffusers to specify these cases only run on accelerator whose VRAM are larger than a given number.

In my env, I found the slowness mainly comes from model downloading. Once the model is downloaded, the test can pass pretty fast.

@yao-matrix
Copy link
Contributor Author

@ydshieh , I tested the 4 cases w/ cpu_offload() on XPU:

  • 2 cases failed for ground-truth mismatch, which are fine
    tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate
    tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_batched_generate_multi_image
  • 2 cases passed
    tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_generate
    tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest::test_mistral3_integration_generate_text_only

So it's work on XPU.
Yet, I found the test will be pretty slow after enabling cpu offload, and if we run 4 cases in one pytest command, the process is easy to hang(using pytest -rA tests/models/mistral3/test_modeling_mistral3.py::Mistral3IntegrationTest), and I found process VIRT memory will be tens of T(33.6T in one of my experiment), separately run each case is fine. Don't know whether you can observe similar in your env.
If you can observe similar w/ me, I will think it's not suitable to use cpu_offload since it has risk to break the CI. And maybe we can consider the require_big_accelerator decorator used by diffusers to specify these cases only run on accelerator whose VRAM are larger than a given number.

In my env, I found the slowness mainly comes from model downloading. Once the model is downloaded, the test can pass pretty fast.

so, it's my env issue. Thx @faaany for testing.

@ydshieh
Copy link
Collaborator

ydshieh commented May 8, 2025

OK, thank you both very much. Let's try cpu offloading so T4 and A10 and use smaller images can run (most) of them. I will push some commits back to this PR.

yao-matrix and others added 4 commits May 8, 2025 19:13
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
@ydshieh
Copy link
Collaborator

ydshieh commented May 8, 2025

Hi @yao-matrix (after deleting the local mistral3-xpu if it is still on your local system)

git fetch https://github.com/yao-matrix/transformers.git mistral3-xpu:mistral3-xpu && git checkout mistral3-xpu

This will run on both T4 16G and A10 24G without any GPU OOM and match the expect values.

If you want to keep tests running directly on XPU without using CPU offload, you can tweak

    def setUp(self):
        cleanup(torch_device, gc_collect=True)
        self.model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
        self.model = Mistral3ForConditionalGeneration.from_pretrained(
            self.model_checkpoint, torch_dtype=torch.bfloat16
        )
        accelerate.cpu_offload(self.model, execution_device=torch_device)

with an if / else. Once you are happy with the results, ping me and I will merge.

@yao-matrix
Copy link
Contributor Author

@ydshieh , cool, OK from my side. Pls feel free to merge.

@ydshieh
Copy link
Collaborator

ydshieh commented May 9, 2025

I think XPU will still get 2 cases failed for ground-truth mismatch as you mentioned (but you said which are fine).
I will merge as it is then if XPU needs an update of expected values, we can do it in another PR.

Thank you for the patience.

@ydshieh ydshieh enabled auto-merge (squash) May 9, 2025 06:29
@ydshieh ydshieh merged commit 1dfad4b into huggingface:main May 9, 2025
14 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yao-matrix yao-matrix deleted the mistral3-xpu branch May 9, 2025 07:06
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
* enabled mistral3 test cases on XPU

Signed-off-by: Yao Matrix <matrix.yao@intel.com>

* calibrate A100 expectation

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* update

* update

* update

* update

* update

* update

---------

Signed-off-by: Yao Matrix <matrix.yao@intel.com>
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants