Skip to content

Releases: ml-explore/mlx

v0.15.2

27 Jun 18:21
d6383a1
Compare
Choose a tag to compare

v0.15.1

14 Jun 21:13
af9079c
Compare
Choose a tag to compare

🚀

v0.15.0

07 Jun 03:16
cf236fc
Compare
Choose a tag to compare

Highlights

  • Fast Metal GPU FFTs
  • mx.distributed with all_sum and all_gather

Core

  • Added dlpack device __dlpack_device__
  • Fast GPU FFTs benchmarks
  • Add docs for the mx.distributed
  • Add mx.view op

NN

  • softmin, hardshrink, and hardtanh activations

Bugfixes

  • Fix broadcast bug in bitwise ops
  • Allow more buffers for JIT compilation
  • Fix matvec vector stride bug
  • Fix multi-block sort stride management
  • Stable cumprod grad at 0
  • Buf fix with race condition in scan

v0.14.1

31 May 19:34
0798824
Compare
Choose a tag to compare

🚀

v0.14.0

24 May 01:33
9f9cb7a
Compare
Choose a tag to compare

Highlights

  • Small-size build that JIT compiles kernels and omits the CPU backend which results in a binary <4MB
    • Series of PRs 1, 2, 3, 4, 5
  • mx.gather_qmm quantized equivalent for mx.gather_mm which speeds up MoE inference by ~2x
  • Grouped 2D convolutions

Core

  • mx.conjugate
  • mx.conv3d and nn.Conv3d
  • List based indexing
  • Started mx.distributed which uses MPI (if installed) for communication across machines
    • mx.distributed.init
    • mx.distributed.all_gather
    • mx.distributed.all_reduce_sum
  • Support conversion to and from dlpack
  • mx.linalg.cholesky on CPU
  • mx.quantized_matmul sped up for vector-matrix products
  • mx.trace
  • mx.block_masked_mm now supports floating point masks!

Fixes

  • Error messaging in eval
  • Add some missing docs
  • Scatter index bug
  • The extensions example now compiles and runs
  • CPU copy bug with many dimensions

v0.13.1

17 May 03:52
6a9b584
Compare
Choose a tag to compare

🚀

v0.13.0

10 May 01:21
8bd6bfa
Compare
Choose a tag to compare

Highlights

  • Block sparse matrix multiply speeds up MoEs by >2x
  • Improved quantization algorithm should work well for all networks
  • Improved gpu command submission speeds up training and inference

Core

  • Bitwise ops added:
    • mx.bitwise_[or|and|xor], mx.[left|right]_shift, operator overloads
  • Groups added to Conv1d
  • Added mx.metal.device_info to get better informed memory limits
  • Added resettable memory stats
  • mlx.optimizers.clip_grad_norm and mlx.utils.tree_reduce added
  • Add mx.arctan2
  • Unary ops now accept array-like inputs ie one can do mx.sqrt(2)

Bugfixes

  • Fixed shape for slice update
  • Bugfix in quantize that used slightly wrong scales/biases
  • Fixed memory leak for multi-output primitives encountered with gradient checkpointing
  • Fixed conversion from other frameworks for all datatypes
  • Fixed index overflow for matmul with large batch size
  • Fixed initialization ordering that occasionally caused segfaults

v0.12.2

02 May 23:38
02a9fc7
Compare
Choose a tag to compare
Patch bump (#1067)

* version

* use 0.12.2

v0.12.0

25 Apr 21:31
82463e9
Compare
Choose a tag to compare

Highlights

  • Faster quantized matmul

Core

  • mx.synchronize to wait for computation dispatched with mx.async_eval
  • mx.radians and mx.degrees
  • mx.metal.clear_cache to return to the OS the memory held by MLX as a cache for future allocations
  • Change quantization to always represent 0 exactly (relevant issue)

Bugfixes

  • Fixed quantization of a block with all 0s that produced NaNs
  • Fixed the len field in the buffer protocol implementation

v0.11.0

18 Apr 20:25
090ff65
Compare
Choose a tag to compare

Core

  • mx.block_masked_mm for block-level sparse matrix multiplication
  • Shared events for synchronization and asynchronous evaluation

NN

  • nn.QuantizedEmbedding layer
  • nn.quantize for quantizing modules
  • gelu_approx uses tanh for consistency with PyTorch