Split encoders in non-concurrent context with a max ops per encoder #1085
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Speeds up generation and slightly training, some benchmarks:
Benchmarks on an M2 Ultra
LLM generation
python -m mlx_lm.generate --model mlx-community/NeuralBeagle14-7B-4bit-mlx --prompt "Write a story about Albert Einstein" --temp 0.0 --max-tokens 256
Pre:
Prompt: 222.239 tokens-per-sec
Generation: 107.239 tokens-per-sec
Post:
Prompt: 223.145 tokens-per-sec
Generation: 108.463 tokens-per-sec
MNIST
Pre: Test accuracy 0.936, Time 0.632 (s)
Post: Test accuracy 0.927, Time 0.625 (s)
LeNet
Pre: Test accuracy 0.981, Time 2.797 (s)
Post: Test accuracy 0.975, Time 2.757 (s)
ResNet
Pre: Throughput: 6462.77 samples/second, Peak memory 1.663 (GB)
Post: Throughput: 6474.81 samples/second, Peak memory 1.663 (GB)
Transformer training:
python main.py --gpu
Pre: Iter 40: Train loss 7.864, It/sec 5.881, Peak memory 5.534 (GB)
Post: Iter 40: Train loss 7.814, It/sec 5.902, Peak memory 5.534 (GB)