Use MPS backend on Apple Silicon devices if it's available. (Updated) #113

niw · 2024-10-16T22:44:23Z

This is slighly updated version of #108. Since #108 was accidentally merged and reverted and I can no longer update it with new changes, this Pull Request is newerly created.

For MacBook users who want to try, follow next steps. It needs enough memory, probably about 32GB.

Clone this repository jy0205/Pyramid-Flow and apply this Pull Request changes.
Install Python 3.10 by using Homebrew, for example, brew install [email protected]
Use virtualenv to install dependencies, for example, python3.10 -m venv .venv && .venv/bin/pip3 install -r requirements.txt
Install extra dependency, gradio for app.py, like .venv/bin/pip3 install gradio
Run .venv/bin/python3 app.py, then open http://127.0.0.1:7860/.

To generate video, try minimum settings first. It takes loooong time anyways on MacBook (about 10 minuts for 3 seconds video, for example, well, it's still remarkable, tho!)

Model resolution: 384p
Duration 2 or 3

Problems

For the inference, it works faster by using MPS backend on Apple Silicon devices but it's not enabled by default and requires some modification to the code, which only considering CUDA availability.

Solution

Use MPS backend if it's available.

Use compatible dtype.
Enable CPU offloading only if CUDA backend is used (disabled for MPS backend, which is unnecesary because of unified memory.)
Use the latest pytorch, which seems required to address VAE issue.

NOTE: This patch is not taking trainig account at all, only for inference. I tried to make it works as well as CUDA with this patch, but because of for example, dependencies update, which may not be preferred, therefore I don't expect that this Pull Request is mergable into main for now. However, anyways posting here because I think it’s worth to have it for those who want to try inferencing easily on such as thier MacBook Pro.

rahulbudhrani01 · 2024-10-20T11:38:27Z

Hey I tried applying your patch locally but I keep getting this error when trying to generate
[ERROR] Error during image-to-video generation: Conv3D is not supported on MPS

Can you please help resolve this? Running it on M1 MAc 32 GB
Torch version ->2.1.2
Torchvision version -> 0.16.2
torch.has_mps -> true

niw · 2024-10-20T13:07:04Z

@rahulbudhrani01 You need to update torch and torchvision from nightly build, even the latest released version can not work properly especially VAE encoder.
See this change on requirements.txt.

YAY-3M-TA3 · 2024-10-21T12:21:16Z

Hello - I've been working on trying to get this to work on MPS backend too. My system is M2 with 24GB ram. I set python 3.10.13 with nightly pytorch (2.6.0).

I'm using the 384 model.

I also did the modifications in the code base throughout such as changing "CUDA" device references to "MPS", changing any float64 errors for bfloat16(rope) and adding an extra with torch.autocast("mps", dtype=torch.bfloat16): to the textencoder block in the pipeline. (otherwise, the prompt_embeds would be NaN). So, I can generate a 1 frame video...
https://github.com/user-attachments/assets/210ed9d3-24f2-474e-b2bc-5004ecb632e9

BUT, If I try to generate more then 1 frame, I get this...

32fecb56-f0d6-4fba-9022-f794da28b1df_text_to_video_sample.mp4

I'm wondering if you might have an idea what is the problem. This looks like when you try to create a pyramid flow video with a resolution that is not 640x384

niw · 2024-10-21T12:33:06Z

@YAY-3M-TA3 See this pull request for more details but what you need to do is following inline.

changing any float64 errors for bfloat16(rope)
adding an extra with torch.autocast("mps", dtype=torch.bfloat16): to the textencoder block in the pipeline. (otherwise, the prompt_embeds would be NaN).

mps doesn't support bfloat I believe, also torch autocast may not working on mps, therefore probably just using float32 and disabling autocast would need.

If I try to generate more then 1 frame, I get this...
I set python 3.10.13 with nightly pytorch (2.6.0).

In my case, I've also seen this colorful output caused in step of vae encoding, and it was caued with old pytorch. If I use the latest nightly pytorch 2.6.0.dev then it worked. I haven't really traced which part of vae encoding causing the issue though.

So.. I recommend to ensure which version of pytorch is really used, also try float32 instead.

YAY-3M-TA3 · 2024-10-21T18:10:39Z

@YAY-3M-TA3 See this pull request for more details but what you need to do is following inline.

Yeah, I grabbed your code and ran it - model inits to float32, but on my small 24Gb m2, it OOM (looks like it needs 27GB at least.).

However, if I force the model_dtype = "bf16" in app.py, then it can run and output that 1 frame. (I believe torch 2.6.0 does support bloat16 on MPS).

However, if I try to do more than 1 frame, then I get that other RGB video.

niw · 2024-10-21T22:01:13Z

@YAY-3M-TA3 You're correct! I just chcecked recent pyrotch changes and yeah, the nightly supports bfloat16 on Sonoma and later also autocast.

I am now testing bfloat16 now and it seems working to me on my test script not app.py with limited duration, and memory pressure gets low about 20 to 30GB total at max which is good!
If it seems okay, I'll try update app.py and this pull requests.

c14de13e-e835-4813-8883-b68bf7a6e008.mp4

A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky

YAY-3M-TA3 · 2024-10-22T03:17:58Z

Ok - I tried your changes(adding them to your app.py) - I did get an error. (Did you get this at all?)

File "/Pyramid-Flow-MPS/video_vae/modeling_causal_conv.py", line 137, in forward
    x = self.conv(x)

  File "/miniforge3/envs/Pyramid-Flow-MPS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/miniforge3/envs/Pyramid-Flow-MPS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniforge3/envs/Pyramid-Flow-MPS/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 725, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/miniforge3/envs/Pyramid-Flow-MPS/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 720, in _conv_forward

    return F.conv3d(
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

So, I made this change in video_vae/modeling_causal_conv.py to solve...(caching the x dtype, then casting, then casting back)

        x_dtype = x.dtype
        x = x.to(self.conv.weight.dtype)
        
        x = self.conv(x)

        x = x.to(x_dtype)

While this did fix the error, I still get that weird video for videos longer than 1 frame.

niw · 2024-10-22T06:19:26Z

@YAY-3M-TA3 I didn't see such error while I was testing. What if you use https://github.com/niw/Pyramid-Flow/blob/add_simple_cli_command/generate.py which is a simple version of script I use for testing this implementation.
I don't think it would address your issue because it's mostly the same logic used in app.py (just simplified for repeating operations.) though.
Also I wonder which macOS/MacBook are you using? I'm on 15.0.1 (24A348) with M1 Max SoC, and it's very unlikely though I know MPS (or CoreML) sometimes has different behavior per macOS version and/or SoC.

YAY-3M-TA3 · 2024-10-22T08:47:28Z

@YAY-3M-TA3 I didn't see such error while I was testing. What if you use https://github.com/niw/Pyramid-Flow/blob/add_simple_cli_command/generate.py

Yeah, I also have my own test script which is based on generate.py to simply render a video with hardcoded values. I get the same video issue.

Also I wonder which macOS/MacBook are you using?

Here are my specs:
MacBookPro 2023, M2, 24GB ram
OS: Sonoma 14.7

Python: 3.10.13

Here is a cherry-picked list of modules for this conda env

accelerate                0.30.0
diffusers                 0.30.3

ffmpy                     0.4.0
gradio                    5.0.2
gradio_client             1.4.0
huggingface-hub           0.25.2
imageio                   2.33.1
imageio-ffmpeg            0.5.1

numpy                     1.24.4

opencv-python-headless    4.10.0.84
safetensors               0.4.5
tokenizers                0.15.2

torch                     2.6.0.dev20241011
torchaudio                2.5.0.dev20241011
torchmetrics              1.4.3
torchvision               0.20.0.dev20241011
transformers              4.39.3

With this setup, I have been able to run things like flux dev with Q8 GUFFs. (Both with mflux and comfyUI)
.
.

I'm on 15.0.1 (24A348) with M1 Max SoC, and it's very unlikely though I know MPS (or CoreML) sometimes has different behavior per macOS version and/or SoC.

Haha - I've been reluctant to upgrade my OS to 15 because I heard it was broken with torch... (I also follow this torch MPS thread... pytorch MPS issue)

@feifeiobama told me that the causal VAE that they are using can only be modified for 1, 9, and 17 frame conditioned video generation. I tried each of these frame values and also 8 and 16. All of these frame values also result in this color warped video.

(I also set the tile_sample_min_size=64 to try to reduce memory. I noticed, that your test video had no tiling artifacts... what tile_sample_min_size are you using 256? Or is your save_memory = false? )

5478316a-95ca-44c1-b3b0-47b61448745a_text_to_video_sample.mp4

niw · 2024-10-22T09:04:15Z

Interesting..., I may want to try on macOS 14 and if I can repro (need to find someone nearby who has that machine, likely VM is not an option for GPU work). Sounds likely it's related.

I know that M2 SoC has some unexpected behavior with specific math graph on CoreML, but that should be unrelated. May need to trace VAE code as well as pytorch mps implementaion, I haven't really looked into them yet...

niw · 2024-10-22T09:25:00Z

Also i noticed I am using slightly newer version of nightly build, but even if I downgraded to 2.6.0.dev20241011, I couldn't repto the problem.

torch 2.6.0.dev20241015
torchvision 0.20.0.dev20241015

YAY-3M-TA3 · 2024-10-22T11:35:13Z

May need to trace VAE code as well as pytorch mps implementaion, I haven't really looked into them yet...

On your small test video, how many frames did you do? 8? Are you able to create a video with only 2 frames? (I can at least use a 2 frame video as a confirmed positive case, which will make the VAE tracing faster...)

niw · 2024-10-22T13:41:31Z

@YAY-3M-TA3 I am using duration=2 for testing.

And... with help from @kagemiku, it's identified that macOS 14 caused the issue but macOS 15 seems okay.

Next step is understanding "why," but at least that is the reason I think.

YAY-3M-TA3 · 2024-10-22T14:00:53Z

May need to trace VAE code as well as pytorch mps implementaion, I haven't really looked into them yet...

Looking in modeling_causal.vae.py in def tiled_decode
down at line 517
dec = torch.cat(result_rows, dim=3)

For a 1 frame video, the decode tensor values look like this before image processing:

tensor([[[[[-0.2451, -0.3242, -0.3574,  ..., -0.9727, -0.9727, -0.8555],
           [-0.3359, -0.4043, -0.3984,  ..., -1.0000, -1.0156, -0.9492],  ...

However, on a 2 frame video, thedecode tensor values look like this:

tensor([[[[[-8.6892e+24, -1.9796e+25, -3.0676e+25,  ...,  2.9014e+26,
             2.4179e+26,  4.4730e+25],
           [-1.8814e+25, -4.2010e+25, -6.3771e+25,  ...,  8.6076e+26,
             6.8183e+26,  2.1156e+26], ...

I am assuming the value ranges should more like in the 1 frame tensor...

niw · 2024-10-23T10:34:53Z

Okay I’ve identifieed the problem. It's kind of pytorch bug, or mismatch between pytorch expectation vs MPS behavior, and this mismatch is only happening on prior to macOS 15 because pytorch 2.5.0 on macOS 15 is using native stride.

I've updated this pull request with fix, but can't test on macOS 14.
@YAY-3M-TA3 if you have time, try this new change!

YAY-3M-TA3 · 2024-10-23T11:12:16Z

@YAY-3M-TA3 if you have time, try this new change!

Yes, I m very happy to help you! - you have done a fantastic job! So, far I just tested a 2 frame and it worked!

baf388f1-c4b6-4134-8bac-aac67129df2b_text_to_video_sample.mp4

I'm now going to do 9 frame, and 16 frame.

FYI: It also still works with torch nightly (2.6.0)...

I will update you in a couple of hours as I finish the other video renders!

YAY-3M-TA3 · 2024-10-23T13:32:06Z

@niw

OK! all confirmed - its working for SONOMA 14.7, with Torch Nightly (2.6.0)
Mac M2 with 24Gb
Python 3.10.13

Pyramid 384 model
9 frames (~45 minutes to render)

cdf94cae-f409-459d-b112-75c12479e3eb_text_to_video_sample-9frames.mp4

16 frames (~85 minutes to render)

9efcd4bd-477c-4aaf-9169-8caaec57aaf5_text_to_video_sample-16frames.mp4

Considering no modern video diffusion model was working on Macs until now! And now we can even render a 16 frame, 640x384 with less than 24Gb - this is quite a milestone!

I never would have seen that
sample = sample.contiguous() and
hidden_states = hidden_states.contiguous() would have been the key towards the solution - Great work!

niw · 2024-10-23T13:44:01Z

@YAY-3M-TA3 Thanks for the confirmation! I'm glad that it sovled the problem.

Similar to #123049, however, `SiLU` also produces random values, `0.0`, or `NaN` as results if input tensor is not contiguous on prior to macOS 15.0. Orignally the problem was found at jy0205/Pyramid-Flow#113. Pull Request resolved: #139006 Approved by: https://github.com/malfet

feifeiobama · 2024-10-31T07:49:44Z

Thanks for supporting miniFLUX @niw

niw · 2024-10-31T07:55:55Z

Still I'm testing niw, though, it seems working as expected. I think the patch working well with Apple Mac thus, If you can test it doesn't break on cuda environment (and if you are confortable with the change, of course!) feel free to merge into main.

niw · 2024-10-31T08:06:45Z

The output seems better than sd3 ones! It'a really impressive that such video can be generated quickly on laptop locally.

a old man standing in front of a wall clock at train station with crowded peopel.

a289615a-348f-4683-9add-fdb933e93cb8_text_to_video_sample.mp4

ahaubenstock · 2024-11-02T12:21:29Z

thank you all for the help. do you have any idea why the memory would be fine during the generation and then as soon as it hits 100% (generated all frames), it suddenly spikes to like 4x of what it was running at during gen?

feifeiobama · 2024-11-02T12:35:18Z

thank you all for the help. do you have any idea why the memory would be fine during the generation and then as soon as it hits 100% (generated all frames), it suddenly spikes to like 4x of what it was running at during gen?

This is likely due to VAE decoding, see #5 (comment).

ahaubenstock · 2024-11-02T13:01:39Z

@feifeiobama thanks for the reply. I tried reducing the tiling to a very low value of 32, and I am still getting the following out-of-memory error:

MPS backend out of memory (MPS allocated: 15.33 GB, other allocations: 45.87 GB, max allowed: 61.20 GB). Tried to allocate 256 bytes on shared pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).'

I am generating only 3 frames with both guidances at 1, and this is on an M3 Max with 48GB RAM, Sonoma 14.6.1. It just seems like the memory usage should be much much lower. I've got to be misunderstanding something.

Similar to pytorch#123049, however, `SiLU` also produces random values, `0.0`, or `NaN` as results if input tensor is not contiguous on prior to macOS 15.0. Orignally the problem was found at jy0205/Pyramid-Flow#113. Pull Request resolved: pytorch#139006 Approved by: https://github.com/malfet

cocktailpeanut · 2024-11-13T21:25:45Z

Hi, any plans to merge this in so mac users can also use this? Or is there a reason why it's not merged in yet?

cocktailpeanut · 2024-11-17T13:03:41Z

@niw I've tested this and confirm it's working. could you merge in the upstream changes to your branch (at least until this gets merged in), especially the new stuff I added? (the seed feature + dynamic duration setting) #189

niw · 2024-11-18T00:59:27Z

@cocktailpeanut nice change! let me do that.

- Use pytorch 2.5.0 instead of nightly. - FIX: activation error on MPS MPS can't silu activation and creates randomly broken results if tensor memory format is not contiguous. This is not happening on macOS 15 and later because it's using native stride but macOS 14 is affected.

PakanAngel · 2024-11-18T17:01:03Z

@niw I have cloned use_mps_on_apple_silicon, but after 100% of generation I got this error about FFmpeg not installed or looking for FFmpeg EXE

[INFO] Text-to-video generation completed.
[ERROR] Error exporting video: No ffmpeg exe could be found. Install ffmpeg on your system, or set the IMAGEIO_FFMPEG_EXE environment variable.

Any help about this issue

niw · 2024-11-18T17:54:11Z

@PakanAngel Likely you need to install ffmpeg by using such as Homebrew.

$ brew install ffmpeg

niw · 2024-11-19T05:55:40Z

Hi, any plans to merge this in so mac users can also use this? Or is there a reason why it's not merged in yet?

@cocktailpeanut I just found that pytorch>2.4.1 whch is used for this branch doesn't work with cuda and failed to generate latents. It may not good to merge this change to main at this moment.

cocktailpeanut · 2024-11-19T09:05:11Z

@niw couldn't we create a separate file requirements_mac.txt?

TBH even the current requirements.txt file doesn't even work for CUDA anyway (it will install CPU torch instead of CUDA torch if you just run pip install -r requirements.txt). Anyway, even if we ignore that, if this is the only thing that's blocking this branch from merged in, I think this is a good solution.

See liveportrait as an example: https://github.com/KwaiVGI/LivePortrait/blob/main/requirements_macOS.txt

niw · 2024-11-19T18:03:40Z

@niw couldn't we create a separate file requirements_mac.txt?
TBH even the current requirements.txt file doesn't even work for CUDA anyway (it will install CPU torch instead of CUDA torch if you just run pip install -r requirements.txt).

Oh hmm. At least it worked (on amd64/linux) to me tho. Also I addressed the pytorch 2.5.1 changes in previous commit so this requirements should work both on cuda and mps.
But as you mentioned, probably it’s better to have a separated setup path for on mac, which is only for inference as well.

niw mentioned this pull request Oct 16, 2024

Use MPS backend on Apple Silicon devices if it's available. #108

Merged

feifeiobama added the enhancement New feature or request label Oct 17, 2024

niw mentioned this pull request Oct 27, 2024

[MPS] Fixes SiLU on non-contiguous tensors pytorch/pytorch#139006

Closed

niw force-pushed the use_mps_on_apple_silicon branch 2 times, most recently from 158494c to 70e5e1a Compare October 31, 2024 07:41

niw force-pushed the use_mps_on_apple_silicon branch from 70e5e1a to c5fffe8 Compare November 14, 2024 19:56

niw force-pushed the use_mps_on_apple_silicon branch from c5fffe8 to 7abf210 Compare November 18, 2024 01:02

Address for pytorch 2.5 behavior change.

1389532

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use MPS backend on Apple Silicon devices if it's available. (Updated) #113

Use MPS backend on Apple Silicon devices if it's available. (Updated) #113

niw commented Oct 16, 2024 •

edited

Loading

rahulbudhrani01 commented Oct 20, 2024 •

edited

Loading

niw commented Oct 20, 2024 •

edited

Loading

YAY-3M-TA3 commented Oct 21, 2024

niw commented Oct 21, 2024

YAY-3M-TA3 commented Oct 21, 2024 •

edited

Loading

niw commented Oct 21, 2024 •

edited

Loading

YAY-3M-TA3 commented Oct 22, 2024 •

edited

Loading

niw commented Oct 22, 2024 •

edited

Loading

YAY-3M-TA3 commented Oct 22, 2024 •

edited

Loading

niw commented Oct 22, 2024 •

edited

Loading

niw commented Oct 22, 2024

YAY-3M-TA3 commented Oct 22, 2024

niw commented Oct 22, 2024 •

edited

Loading

YAY-3M-TA3 commented Oct 22, 2024

niw commented Oct 23, 2024 •

edited

Loading

YAY-3M-TA3 commented Oct 23, 2024 •

edited

Loading

YAY-3M-TA3 commented Oct 23, 2024

niw commented Oct 23, 2024

feifeiobama commented Oct 31, 2024

niw commented Oct 31, 2024

niw commented Oct 31, 2024

ahaubenstock commented Nov 2, 2024

feifeiobama commented Nov 2, 2024

ahaubenstock commented Nov 2, 2024

cocktailpeanut commented Nov 13, 2024

cocktailpeanut commented Nov 17, 2024

niw commented Nov 18, 2024

PakanAngel commented Nov 18, 2024 •

edited

Loading

niw commented Nov 18, 2024 •

edited

Loading

niw commented Nov 19, 2024

cocktailpeanut commented Nov 19, 2024

niw commented Nov 19, 2024 •

edited

Loading

Use MPS backend on Apple Silicon devices if it's available. (Updated) #113

Are you sure you want to change the base?

Use MPS backend on Apple Silicon devices if it's available. (Updated) #113

Conversation

niw commented Oct 16, 2024 • edited Loading

rahulbudhrani01 commented Oct 20, 2024 • edited Loading

niw commented Oct 20, 2024 • edited Loading

YAY-3M-TA3 commented Oct 21, 2024

niw commented Oct 21, 2024

YAY-3M-TA3 commented Oct 21, 2024 • edited Loading

niw commented Oct 21, 2024 • edited Loading

YAY-3M-TA3 commented Oct 22, 2024 • edited Loading

niw commented Oct 22, 2024 • edited Loading

YAY-3M-TA3 commented Oct 22, 2024 • edited Loading

niw commented Oct 22, 2024 • edited Loading

niw commented Oct 22, 2024

YAY-3M-TA3 commented Oct 22, 2024

niw commented Oct 22, 2024 • edited Loading

YAY-3M-TA3 commented Oct 22, 2024

niw commented Oct 23, 2024 • edited Loading

YAY-3M-TA3 commented Oct 23, 2024 • edited Loading

YAY-3M-TA3 commented Oct 23, 2024

niw commented Oct 23, 2024

feifeiobama commented Oct 31, 2024

niw commented Oct 31, 2024

niw commented Oct 31, 2024

ahaubenstock commented Nov 2, 2024

feifeiobama commented Nov 2, 2024

ahaubenstock commented Nov 2, 2024

cocktailpeanut commented Nov 13, 2024

cocktailpeanut commented Nov 17, 2024

niw commented Nov 18, 2024

PakanAngel commented Nov 18, 2024 • edited Loading

niw commented Nov 18, 2024 • edited Loading

niw commented Nov 19, 2024

cocktailpeanut commented Nov 19, 2024

niw commented Nov 19, 2024 • edited Loading

niw commented Oct 16, 2024 •

edited

Loading

rahulbudhrani01 commented Oct 20, 2024 •

edited

Loading

niw commented Oct 20, 2024 •

edited

Loading

YAY-3M-TA3 commented Oct 21, 2024 •

edited

Loading

niw commented Oct 21, 2024 •

edited

Loading

YAY-3M-TA3 commented Oct 22, 2024 •

edited

Loading

niw commented Oct 22, 2024 •

edited

Loading

YAY-3M-TA3 commented Oct 22, 2024 •

edited

Loading

niw commented Oct 22, 2024 •

edited

Loading

niw commented Oct 22, 2024 •

edited

Loading

niw commented Oct 23, 2024 •

edited

Loading

YAY-3M-TA3 commented Oct 23, 2024 •

edited

Loading

PakanAngel commented Nov 18, 2024 •

edited

Loading

niw commented Nov 18, 2024 •

edited

Loading

niw commented Nov 19, 2024 •

edited

Loading