Block sparse mm #1058

jagrit06 · 2024-04-30T16:33:13Z

Proposed changes

Adds operation and primitive to gather matrices before matmul on the fly

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

python/src/ops.cpp

mlx/ops.cpp

awni · 2024-05-01T03:06:47Z

Very cool!! Can't wait to try this in an MOE!

awni · 2024-05-02T02:47:29Z

Some MOE benchmarks:

Generation

python -m mlx_lm.generate --model Qwen/Qwen1.5-MoE-A2.7B-Chat  --prompt "Write a story about Einstein" --max-tokens 256 --temp 0.0

Pre: 31.285 tokens-per-sec
Post: 72.387 tokens-per-sec

LoRA

python -m mlx_lm.lora --train --iters 50 --model Qwen/Qwen1.5-MoE-A2.7B-Chat --data ../lora/data

Pre: Iter 30: Train loss 1.475, Learning Rate 1.000e-05, It/sec 1.692, Tokens/sec 291.262, Trained Tokens 5325, Peak mem 29.248 GB
Post: Iter 30: Train loss 1.466, Learning Rate 1.000e-05, It/sec 2.724, Tokens/sec 468.749, Trained Tokens 5325, Peak mem 28.051 GB

awni

🚀

jagrit06 marked this pull request as ready for review April 30, 2024 17:29

jagrit06 requested a review from awni April 30, 2024 17:30

jagrit06 mentioned this pull request Apr 30, 2024

[BUG] Matmul gives wrong output for large sizes #1051

Closed

jagrit06 linked an issue Apr 30, 2024 that may be closed by this pull request

[BUG] Matmul gives wrong output for large sizes #1051

Closed