Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MPS backend on Apple Silicon devices if it's available. (Updated) #113

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

niw
Copy link
Contributor

@niw niw commented Oct 16, 2024

This is slighly updated version of #108. Since #108 was accidentally merged and reverted and I can no longer update it with new changes, this Pull Request is newerly created.

For MacBook users who want to try, follow next steps. It needs enough memory, probably about 32GB.

  1. Clone this repository jy0205/Pyramid-Flow and apply this Pull Request changes.
  2. Install Python 3.10 by using Homebrew, for example, brew install [email protected]
  3. Use virtualenv to install dependencies, for example, python3.10 -m venv .venv && .venv/bin/pip3 install -r requirements.txt
  4. Install extra dependency, gradio for app.py, like .venv/bin/pip3 install gradio
  5. Run .venv/bin/python3 app.py, then open http://127.0.0.1:7860/.

To generate video, try minimum settings first. It takes loooong time anyways on MacBook (about 10 minuts for 3 seconds video, for example, well, it's still remarkable, tho!)

  • Model resolution: 384p
  • Duration 2 or 3

Problems

For the inference, it works faster by using MPS backend on Apple Silicon devices but it's not enabled by default and requires some modification to the code, which only considering CUDA availability.

Solution

Use MPS backend if it's available.

  • Use compatible dtype.
  • Enable CPU offloading only if CUDA backend is used (disabled for MPS backend, which is unnecesary because of unified memory.)
  • Use the latest pytorch, which seems required to address VAE issue.

NOTE: This patch is not taking trainig account at all, only for inference. I tried to make it works as well as CUDA with this patch, but because of for example, dependencies update, which may not be preferred, therefore I don't expect that this Pull Request is mergable into main for now. However, anyways posting here because I think it’s worth to have it for those who want to try inferencing easily on such as thier MacBook Pro.

@rahulbudhrani01
Copy link

rahulbudhrani01 commented Oct 20, 2024

Hey I tried applying your patch locally but I keep getting this error when trying to generate
[ERROR] Error during image-to-video generation: Conv3D is not supported on MPS

Can you please help resolve this? Running it on M1 MAc 32 GB
Torch version ->2.1.2
Torchvision version -> 0.16.2
torch.has_mps -> true

@niw
Copy link
Contributor Author

niw commented Oct 20, 2024

@rahulbudhrani01 You need to update torch and torchvision from nightly build, even the latest released version can not work properly especially VAE encoder.
See this change on requirements.txt.

@YAY-3M-TA3
Copy link

Hello - I've been working on trying to get this to work on MPS backend too. My system is M2 with 24GB ram. I set python 3.10.13 with nightly pytorch (2.6.0).

I'm using the 384 model.

I also did the modifications in the code base throughout such as changing "CUDA" device references to "MPS", changing any float64 errors for bfloat16(rope) and adding an extra with torch.autocast("mps", dtype=torch.bfloat16): to the textencoder block in the pipeline. (otherwise, the prompt_embeds would be NaN). So, I can generate a 1 frame video...
https://github.com/user-attachments/assets/210ed9d3-24f2-474e-b2bc-5004ecb632e9

BUT, If I try to generate more then 1 frame, I get this...

32fecb56-f0d6-4fba-9022-f794da28b1df_text_to_video_sample.mp4

I'm wondering if you might have an idea what is the problem. This looks like when you try to create a pyramid flow video with a resolution that is not 640x384

@niw
Copy link
Contributor Author

niw commented Oct 21, 2024

@YAY-3M-TA3 See this pull request for more details but what you need to do is following inline.

changing any float64 errors for bfloat16(rope)
adding an extra with torch.autocast("mps", dtype=torch.bfloat16): to the textencoder block in the pipeline. (otherwise, the prompt_embeds would be NaN).

mps doesn't support bfloat I believe, also torch autocast may not working on mps, therefore probably just using float32 and disabling autocast would need.

If I try to generate more then 1 frame, I get this...
I set python 3.10.13 with nightly pytorch (2.6.0).

In my case, I've also seen this colorful output caused in step of vae encoding, and it was caued with old pytorch. If I use the latest nightly pytorch 2.6.0.dev then it worked. I haven't really traced which part of vae encoding causing the issue though.

So.. I recommend to ensure which version of pytorch is really used, also try float32 instead.

@YAY-3M-TA3
Copy link

YAY-3M-TA3 commented Oct 21, 2024

@YAY-3M-TA3 See this pull request for more details but what you need to do is following inline.

Yeah, I grabbed your code and ran it - model inits to float32, but on my small 24Gb m2, it OOM (looks like it needs 27GB at least.).

However, if I force the model_dtype = "bf16" in app.py, then it can run and output that 1 frame. (I believe torch 2.6.0 does support bloat16 on MPS).

However, if I try to do more than 1 frame, then I get that other RGB video.

@niw
Copy link
Contributor Author

niw commented Oct 21, 2024

@YAY-3M-TA3 You're correct! I just chcecked recent pyrotch changes and yeah, the nightly supports bfloat16 on Sonoma and later also autocast.

I am now testing bfloat16 now and it seems working to me on my test script not app.py with limited duration, and memory pressure gets low about 20 to 30GB total at max which is good!
If it seems okay, I'll try update app.py and this pull requests.

c14de13e-e835-4813-8883-b68bf7a6e008.mp4

A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky

@YAY-3M-TA3
Copy link

YAY-3M-TA3 commented Oct 22, 2024

Ok - I tried your changes(adding them to your app.py) - I did get an error. (Did you get this at all?)

File "/Pyramid-Flow-MPS/video_vae/modeling_causal_conv.py", line 137, in forward
    x = self.conv(x)

  File "/miniforge3/envs/Pyramid-Flow-MPS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/miniforge3/envs/Pyramid-Flow-MPS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniforge3/envs/Pyramid-Flow-MPS/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 725, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/miniforge3/envs/Pyramid-Flow-MPS/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 720, in _conv_forward

    return F.conv3d(
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

So, I made this change in video_vae/modeling_causal_conv.py to solve...(caching the x dtype, then casting, then casting back)

        x_dtype = x.dtype
        x = x.to(self.conv.weight.dtype)
        
        x = self.conv(x)

        x = x.to(x_dtype)

While this did fix the error, I still get that weird video for videos longer than 1 frame.

@niw
Copy link
Contributor Author

niw commented Oct 22, 2024

@YAY-3M-TA3 I didn't see such error while I was testing. What if you use https://github.com/niw/Pyramid-Flow/blob/add_simple_cli_command/generate.py which is a simple version of script I use for testing this implementation.
I don't think it would address your issue because it's mostly the same logic used in app.py (just simplified for repeating operations.) though.
Also I wonder which macOS/MacBook are you using? I'm on 15.0.1 (24A348) with M1 Max SoC, and it's very unlikely though I know MPS (or CoreML) sometimes has different behavior per macOS version and/or SoC.

@YAY-3M-TA3
Copy link

YAY-3M-TA3 commented Oct 22, 2024

@YAY-3M-TA3 I didn't see such error while I was testing. What if you use https://github.com/niw/Pyramid-Flow/blob/add_simple_cli_command/generate.py

Yeah, I also have my own test script which is based on generate.py to simply render a video with hardcoded values. I get the same video issue.

Also I wonder which macOS/MacBook are you using?

Here are my specs:
MacBookPro 2023, M2, 24GB ram
OS: Sonoma 14.7

Python: 3.10.13

Here is a cherry-picked list of modules for this conda env

accelerate                0.30.0
diffusers                 0.30.3

ffmpy                     0.4.0
gradio                    5.0.2
gradio_client             1.4.0
huggingface-hub           0.25.2
imageio                   2.33.1
imageio-ffmpeg            0.5.1

numpy                     1.24.4

opencv-python-headless    4.10.0.84
safetensors               0.4.5
tokenizers                0.15.2

torch                     2.6.0.dev20241011
torchaudio                2.5.0.dev20241011
torchmetrics              1.4.3
torchvision               0.20.0.dev20241011
transformers              4.39.3

With this setup, I have been able to run things like flux dev with Q8 GUFFs. (Both with mflux and comfyUI)
.
.

I'm on 15.0.1 (24A348) with M1 Max SoC, and it's very unlikely though I know MPS (or CoreML) sometimes has different behavior per macOS version and/or SoC.

Haha - I've been reluctant to upgrade my OS to 15 because I heard it was broken with torch... (I also follow this torch MPS thread... pytorch MPS issue)

@feifeiobama told me that the causal VAE that they are using can only be modified for 1, 9, and 17 frame conditioned video generation. I tried each of these frame values and also 8 and 16. All of these frame values also result in this color warped video.

(I also set the tile_sample_min_size=64 to try to reduce memory. I noticed, that your test video had no tiling artifacts... what tile_sample_min_size are you using 256? Or is your save_memory = false? )

5478316a-95ca-44c1-b3b0-47b61448745a_text_to_video_sample.mp4

@niw
Copy link
Contributor Author

niw commented Oct 22, 2024

Interesting..., I may want to try on macOS 14 and if I can repro (need to find someone nearby who has that machine, likely VM is not an option for GPU work). Sounds likely it's related.

I know that M2 SoC has some unexpected behavior with specific math graph on CoreML, but that should be unrelated. May need to trace VAE code as well as pytorch mps implementaion, I haven't really looked into them yet...

@niw
Copy link
Contributor Author

niw commented Oct 22, 2024

Also i noticed I am using slightly newer version of nightly build, but even if I downgraded to 2.6.0.dev20241011, I couldn't repto the problem.

torch 2.6.0.dev20241015
torchvision 0.20.0.dev20241015

@YAY-3M-TA3
Copy link

May need to trace VAE code as well as pytorch mps implementaion, I haven't really looked into them yet...

On your small test video, how many frames did you do? 8? Are you able to create a video with only 2 frames? (I can at least use a 2 frame video as a confirmed positive case, which will make the VAE tracing faster...)

@niw
Copy link
Contributor Author

niw commented Oct 22, 2024

@YAY-3M-TA3 I am using duration=2 for testing.

And... with help from @kagemiku, it's identified that macOS 14 caused the issue but macOS 15 seems okay.

Next step is understanding "why," but at least that is the reason I think.

@YAY-3M-TA3
Copy link

May need to trace VAE code as well as pytorch mps implementaion, I haven't really looked into them yet...

Looking in modeling_causal.vae.py in def tiled_decode
down at line 517
dec = torch.cat(result_rows, dim=3)

For a 1 frame video, the decode tensor values look like this before image processing:

tensor([[[[[-0.2451, -0.3242, -0.3574,  ..., -0.9727, -0.9727, -0.8555],
           [-0.3359, -0.4043, -0.3984,  ..., -1.0000, -1.0156, -0.9492],  ...

However, on a 2 frame video, thedecode tensor values look like this:

tensor([[[[[-8.6892e+24, -1.9796e+25, -3.0676e+25,  ...,  2.9014e+26,
             2.4179e+26,  4.4730e+25],
           [-1.8814e+25, -4.2010e+25, -6.3771e+25,  ...,  8.6076e+26,
             6.8183e+26,  2.1156e+26], ...

I am assuming the value ranges should more like in the 1 frame tensor...

@niw
Copy link
Contributor Author

niw commented Oct 23, 2024

Okay I’ve identifieed the problem. It's kind of pytorch bug, or mismatch between pytorch expectation vs MPS behavior, and this mismatch is only happening on prior to macOS 15 because pytorch 2.5.0 on macOS 15 is using native stride.

I've updated this pull request with fix, but can't test on macOS 14.
@YAY-3M-TA3 if you have time, try this new change!

@YAY-3M-TA3
Copy link

YAY-3M-TA3 commented Oct 23, 2024

@YAY-3M-TA3 if you have time, try this new change!

Yes, I m very happy to help you! - you have done a fantastic job! So, far I just tested a 2 frame and it worked!

baf388f1-c4b6-4134-8bac-aac67129df2b_text_to_video_sample.mp4

I'm now going to do 9 frame, and 16 frame.

FYI: It also still works with torch nightly (2.6.0)...

I will update you in a couple of hours as I finish the other video renders!

@YAY-3M-TA3
Copy link

@niw

OK! all confirmed - its working for SONOMA 14.7, with Torch Nightly (2.6.0)
Mac M2 with 24Gb
Python 3.10.13

Pyramid 384 model
9 frames (~45 minutes to render)

cdf94cae-f409-459d-b112-75c12479e3eb_text_to_video_sample-9frames.mp4

16 frames (~85 minutes to render)

9efcd4bd-477c-4aaf-9169-8caaec57aaf5_text_to_video_sample-16frames.mp4

Considering no modern video diffusion model was working on Macs until now! And now we can even render a 16 frame, 640x384 with less than 24Gb - this is quite a milestone!

I never would have seen that
sample = sample.contiguous() and
hidden_states = hidden_states.contiguous() would have been the key towards the solution - Great work!

@niw
Copy link
Contributor Author

niw commented Oct 23, 2024

@YAY-3M-TA3 Thanks for the confirmation! I'm glad that it sovled the problem.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Oct 30, 2024
Similar to #123049, however, `SiLU` also produces random values, `0.0`, or `NaN` as results if input tensor is not contiguous on prior to macOS 15.0.
Orignally the problem was found at jy0205/Pyramid-Flow#113.
Pull Request resolved: #139006
Approved by: https://github.com/malfet
@niw niw force-pushed the use_mps_on_apple_silicon branch 2 times, most recently from 158494c to 70e5e1a Compare October 31, 2024 07:41
@feifeiobama
Copy link
Collaborator

Thanks for supporting miniFLUX @niw

@niw
Copy link
Contributor Author

niw commented Oct 31, 2024

Still I'm testing niw, though, it seems working as expected. I think the patch working well with Apple Mac thus, If you can test it doesn't break on cuda environment (and if you are confortable with the change, of course!) feel free to merge into main.

@niw
Copy link
Contributor Author

niw commented Oct 31, 2024

The output seems better than sd3 ones! It'a really impressive that such video can be generated quickly on laptop locally.

a old man standing in front of a wall clock at train station with crowded peopel.

a289615a-348f-4683-9add-fdb933e93cb8_text_to_video_sample.mp4

@ahaubenstock
Copy link

thank you all for the help. do you have any idea why the memory would be fine during the generation and then as soon as it hits 100% (generated all frames), it suddenly spikes to like 4x of what it was running at during gen?

@feifeiobama
Copy link
Collaborator

thank you all for the help. do you have any idea why the memory would be fine during the generation and then as soon as it hits 100% (generated all frames), it suddenly spikes to like 4x of what it was running at during gen?

This is likely due to VAE decoding, see #5 (comment).

@ahaubenstock
Copy link

@feifeiobama thanks for the reply. I tried reducing the tiling to a very low value of 32, and I am still getting the following out-of-memory error:

MPS backend out of memory (MPS allocated: 15.33 GB, other allocations: 45.87 GB, max allowed: 61.20 GB). Tried to allocate 256 bytes on shared pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).'

I am generating only 3 frames with both guidances at 1, and this is on an M3 Max with 48GB RAM, Sonoma 14.6.1. It just seems like the memory usage should be much much lower. I've got to be misunderstanding something.

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
Similar to pytorch#123049, however, `SiLU` also produces random values, `0.0`, or `NaN` as results if input tensor is not contiguous on prior to macOS 15.0.
Orignally the problem was found at jy0205/Pyramid-Flow#113.
Pull Request resolved: pytorch#139006
Approved by: https://github.com/malfet
@cocktailpeanut
Copy link
Contributor

Hi, any plans to merge this in so mac users can also use this? Or is there a reason why it's not merged in yet?

@cocktailpeanut
Copy link
Contributor

@niw I've tested this and confirm it's working. could you merge in the upstream changes to your branch (at least until this gets merged in), especially the new stuff I added? (the seed feature + dynamic duration setting) #189

@niw
Copy link
Contributor Author

niw commented Nov 18, 2024

@cocktailpeanut nice change! let me do that.

- Use pytorch 2.5.0 instead of nightly.

- FIX: activation error on MPS

  MPS can't silu activation and creates randomly broken results
  if tensor memory format is not contiguous.

  This is not happening on macOS 15 and later because it's using
  native stride but macOS 14 is affected.
@PakanAngel
Copy link

PakanAngel commented Nov 18, 2024

@niw I have cloned use_mps_on_apple_silicon, but after 100% of generation I got this error about FFmpeg not installed or looking for FFmpeg EXE

[INFO] Text-to-video generation completed.
[ERROR] Error exporting video: No ffmpeg exe could be found. Install ffmpeg on your system, or set the IMAGEIO_FFMPEG_EXE environment variable.

Any help about this issue

@niw
Copy link
Contributor Author

niw commented Nov 18, 2024

@PakanAngel Likely you need to install ffmpeg by using such as Homebrew.

$ brew install ffmpeg

@niw
Copy link
Contributor Author

niw commented Nov 19, 2024

Hi, any plans to merge this in so mac users can also use this? Or is there a reason why it's not merged in yet?

@cocktailpeanut I just found that pytorch>2.4.1 whch is used for this branch doesn't work with cuda and failed to generate latents. It may not good to merge this change to main at this moment.

@cocktailpeanut
Copy link
Contributor

@niw couldn't we create a separate file requirements_mac.txt?

TBH even the current requirements.txt file doesn't even work for CUDA anyway (it will install CPU torch instead of CUDA torch if you just run pip install -r requirements.txt). Anyway, even if we ignore that, if this is the only thing that's blocking this branch from merged in, I think this is a good solution.

See liveportrait as an example: https://github.com/KwaiVGI/LivePortrait/blob/main/requirements_macOS.txt

@niw
Copy link
Contributor Author

niw commented Nov 19, 2024

@niw couldn't we create a separate file requirements_mac.txt?
TBH even the current requirements.txt file doesn't even work for CUDA anyway (it will install CPU torch instead of CUDA torch if you just run pip install -r requirements.txt).

Oh hmm. At least it worked (on amd64/linux) to me tho. Also I addressed the pytorch 2.5.1 changes in previous commit so this requirements should work both on cuda and mps.
But as you mentioned, probably it’s better to have a separated setup path for on mac, which is only for inference as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants