Skip to content

v0.23.0

Compare
Choose a tag to compare
@awni awni released this 14 Feb 21:39
· 42 commits to main since this release
6cec78d

Highlights

  • 4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
  • More performance improvements across the board:
    • Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
    • Faster winograd convolutions, benchmarks
    • Up to 3x faster sort, benchmarks
    • Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
    • Faster unified CPU back-end with vector operations
  • Double precision (mx.float64) support on the CPU

Core

Features

  • Bitwise invert mx.bitwise_invert
  • mx.linalg.lu, mx.linalg.lu_factor, mx.linalg.solve, mx.linalg.solve_triangular
  • Support loading F8_E4M3 from safetensors
  • mx.float64 supported on the CPU
  • Matmul JVPs
  • Distributed launch helper :mlx.launch
  • Support non-square QR factorization with mx.linalg.qr
  • Support ellipsis in mx.einsum
  • Refactor and unify accelerate and common back-ends

Performance

  • Faster synchronization Fence for synchronizing CPU-GPU
  • Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
  • Fast winograd convolutions, benchmarks
  • Allow dynamic ops per buffer based on dispatches and memory, benchmarks
  • Up to 3x faster sort, benchmarks
  • Faster small batch qmv, benchmarks
  • Ring distributed backend
  • Some CPU ops are much faster with the new Simd<T, N>

NN

  • Orthogonal initializer nn.init.orthogonal
  • Add dilation for conv 3d layers

Bug fixes

  • Limit grad recursion depth by not recursing through non-grad inputs
  • Fix synchronization bug for GPU stream async CPU work
  • Fix shapeless compile on ubuntu24
  • Recompile when shapeless changes
  • Fix rope fallback to not upcast
  • Fix metal sort for certain cases
  • Fix a couple of slicing bugs
  • Avoid duplicate malloc with custom kernel init
  • Fix compilation error on Windows
  • Allow Python garbage collector to break cycles on custom objects
  • Fix grad with copies
  • Loading empty list is ok when strict = false
  • Fix split vmap
  • Fixes output donation for IO ops on the GPU
  • Fix creating an array with an int64 scalar
  • Catch stream errors earlier to avoid aborts