Releases: ml-explore/mlx
Releases · ml-explore/mlx
v0.15.2
v0.15.1
v0.15.0
Highlights
- Fast Metal GPU FFTs
- On average ~30x faster than CPU
- More benchmarks
mx.distributed
withall_sum
andall_gather
Core
- Added dlpack device
__dlpack_device__
- Fast GPU FFTs benchmarks
- Add docs for the
mx.distributed
- Add
mx.view
op
NN
softmin
,hardshrink
, andhardtanh
activations
Bugfixes
- Fix broadcast bug in bitwise ops
- Allow more buffers for JIT compilation
- Fix matvec vector stride bug
- Fix multi-block sort stride management
- Stable cumprod grad at 0
- Buf fix with race condition in scan
v0.14.1
v0.14.0
Highlights
- Small-size build that JIT compiles kernels and omits the CPU backend which results in a binary <4MB
mx.gather_qmm
quantized equivalent formx.gather_mm
which speeds up MoE inference by ~2x- Grouped 2D convolutions
Core
mx.conjugate
mx.conv3d
andnn.Conv3d
- List based indexing
- Started
mx.distributed
which uses MPI (if installed) for communication across machinesmx.distributed.init
mx.distributed.all_gather
mx.distributed.all_reduce_sum
- Support conversion to and from dlpack
mx.linalg.cholesky
on CPUmx.quantized_matmul
sped up for vector-matrix productsmx.trace
mx.block_masked_mm
now supports floating point masks!
Fixes
- Error messaging in eval
- Add some missing docs
- Scatter index bug
- The extensions example now compiles and runs
- CPU copy bug with many dimensions
v0.13.1
v0.13.0
Highlights
- Block sparse matrix multiply speeds up MoEs by >2x
- Improved quantization algorithm should work well for all networks
- Improved gpu command submission speeds up training and inference
Core
- Bitwise ops added:
mx.bitwise_[or|and|xor]
,mx.[left|right]_shift
, operator overloads
- Groups added to Conv1d
- Added
mx.metal.device_info
to get better informed memory limits - Added resettable memory stats
mlx.optimizers.clip_grad_norm
andmlx.utils.tree_reduce
added- Add
mx.arctan2
- Unary ops now accept array-like inputs ie one can do
mx.sqrt(2)
Bugfixes
- Fixed shape for slice update
- Bugfix in quantize that used slightly wrong scales/biases
- Fixed memory leak for multi-output primitives encountered with gradient checkpointing
- Fixed conversion from other frameworks for all datatypes
- Fixed index overflow for matmul with large batch size
- Fixed initialization ordering that occasionally caused segfaults
v0.12.2
v0.12.0
Highlights
- Faster quantized matmul
- Up to 40% faster QLoRA or prompt processing, some numbers
Core
mx.synchronize
to wait for computation dispatched withmx.async_eval
mx.radians
andmx.degrees
mx.metal.clear_cache
to return to the OS the memory held by MLX as a cache for future allocations- Change quantization to always represent 0 exactly (relevant issue)
Bugfixes
- Fixed quantization of a block with all 0s that produced NaNs
- Fixed the
len
field in the buffer protocol implementation