No reduction in VRAM usage #17

radna0 · 2024-06-12T12:16:34Z

I tried running the following code, with just having the ```ridger/MMfreeLM-1.3B```` model initialized:

root@r4-0:~/matmulfreellm# python
>>> import os
>>> os.environ["TOKENIZERS_PARALLELISM"] = "false"
>>> import mmfreelm
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> # Change here to our open-sourced model
>>> name = "ridger/MMfreeLM-1.3B"
>>> tokenizer = AutoTokenizer.from_pretrained(name)
>>> model = AutoModelForCausalLM.from_pretrained(name).cuda().half()

Having another terminal opened with 'watch rocm-smi', showing 68% VRAM usage meaning about 5.5GB

Every 2.0s: rocm-smi                                                                                        r4-0: Wed Jun 12 12:16:17 2024



======================================== ROCm System Management Interface ========================================
================================================== Concise Info ==================================================
Device  [Model : Revision]    Temp    Power     Partitions      SCLK    MCLK    Fan    Perf  PwrCap  VRAM%  GPU%
        Name (20 chars)       (Edge)  (Socket)  (Mem, Compute)                                                    
==================================================================================================================
0       [RX Vega64 : 0xc1]    30.0°C  11.0W     N/A, N/A        852Mhz  167Mhz  9.41%  auto  220.0W   68%   0%
        Vega 10 XL/XT [Radeo
==================================================================================================================
============================================== End of ROCm SMI Log ===============================================

Contradicting what was said in the paper?

The text was updated successfully, but these errors were encountered:

ridgerchu · 2024-06-12T15:55:44Z

Hi,
We have highlighted in the paper that we use BitBLAS for conducting those experiments. However, BitBLAS can be challenging to install and is only compatible with NVIDIA GPUs. In fact, we even had to recompile it during our installation process. For those reasons, we haven't merged it into this repo yet. Additionally, due to the different ways FuseBitLinear stores weights, there is still some compatibility work that needs to be completed.

We are also working on merging MatmulFreeLLM into BitBLAS examples. In the meantime, you can try Bitnet's example to achieve a similar level of VRM reduction, which should be comparable to our model.

radna0 · 2024-06-12T16:02:30Z

I see, so we would still have to wait for the repo to be fully functionally working with BitBLAS until that we can not experience the results from the paper nor do training, right?

ridgerchu · 2024-06-12T16:37:09Z

For training it is okay, since we have integrated triton in our current repo, so you can still enjoy the accelerated training, for inference maybe not…

radna0 · 2024-06-12T16:42:44Z

Wait, so you could still train a model and get faster training + vram reduction? It just doesn't work on inference? I might be wrong here but how would we evaluate the model during and after training for the losses, ouputs?

A little bit of context, I'm wanting to train a video generative model

ridgerchu · 2024-06-12T17:48:07Z

You can refer to an and b, these two figures show that how our fused bilinear help to reduce the memory and training speed. (in pure MLP situation)

pranav-asthana · 2024-10-26T19:48:08Z

Hi,

I tested FusedBitLinear vs nn.Linear using a small MLP and I don't see any time speedup for training, in fact it is slower. Here is my model and training time curves for a batch size of 32 run for 10 epochs.

On testing FusedBitLinear vs vanilla BitLinear the time is similar to what you showed.

If you are using BitBLAS for inference, how is it matmul free? Does that not just use mixed precision multiplication?

ridgerchu · 2024-10-27T17:02:47Z

Hi,

The fused BitLinear in small MLPs will not be significantly accelerated with Triton. You can verify this by testing cases where in_features/out_features > 2048.

Regarding Matmul-free operations: As mentioned in our paper, modern GPUs actually don't benefit from Matmul-free approaches, which is why we developed our own FPGA hardware implementation. That's why we still maintain Matmul operations in our code. Our solution is compatible with both Matmul-free and Matmul approaches. While Matmul-free can provide benefits on custom hardware implementations, retaining Matmul operations often yields better performance on general-purpose GPUs. Therefore, we use the fused version to leverage GPU training speed, while utilizing custom hardware to fully benefit from Matmul-free operations.

radna0 mentioned this issue Jun 13, 2024

Rocm(7900xtx) GPU fail #3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No reduction in VRAM usage #17

No reduction in VRAM usage #17

radna0 commented Jun 12, 2024 •

edited

Loading

ridgerchu commented Jun 12, 2024 •

edited

Loading

radna0 commented Jun 12, 2024

ridgerchu commented Jun 12, 2024

radna0 commented Jun 12, 2024

ridgerchu commented Jun 12, 2024 •

edited

Loading

pranav-asthana commented Oct 26, 2024

ridgerchu commented Oct 27, 2024

No reduction in VRAM usage #17

No reduction in VRAM usage #17

Comments

radna0 commented Jun 12, 2024 • edited Loading

ridgerchu commented Jun 12, 2024 • edited Loading

radna0 commented Jun 12, 2024

ridgerchu commented Jun 12, 2024

radna0 commented Jun 12, 2024

ridgerchu commented Jun 12, 2024 • edited Loading

pranav-asthana commented Oct 26, 2024

ridgerchu commented Oct 27, 2024

radna0 commented Jun 12, 2024 •

edited

Loading

ridgerchu commented Jun 12, 2024 •

edited

Loading

ridgerchu commented Jun 12, 2024 •

edited

Loading