Releases
v0.23.0
awni
released this
14 Feb 21:39
Highlights
4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
More performance improvements across the board:
Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
Faster winograd convolutions, benchmarks
Up to 3x faster sort, benchmarks
Much faster mx.put_along_axis
and mx.take_along_axis
, benchmarks
Faster unified CPU back-end with vector operations
Double precision (mx.float64
) support on the CPU
Core
Features
Bitwise invert mx.bitwise_invert
mx.linalg.lu
, mx.linalg.lu_factor
, mx.linalg.solve
, mx.linalg.solve_triangular
Support loading F8_E4M3 from safetensors
mx.float64
supported on the CPU
Matmul JVPs
Distributed launch helper :mlx.launch
Support non-square QR factorization with mx.linalg.qr
Support ellipsis in mx.einsum
Refactor and unify accelerate and common back-ends
Performance
Faster synchronization Fence
for synchronizing CPU-GPU
Much faster mx.put_along_axis
and mx.take_along_axis
, benchmarks
Fast winograd convolutions, benchmarks
Allow dynamic ops per buffer based on dispatches and memory, benchmarks
Up to 3x faster sort, benchmarks
Faster small batch qmv, benchmarks
Ring distributed backend
Some CPU ops are much faster with the new Simd<T, N>
NN
Orthogonal initializer nn.init.orthogonal
Add dilation for conv 3d layers
Bug fixes
Limit grad recursion depth by not recursing through non-grad inputs
Fix synchronization bug for GPU stream async CPU work
Fix shapeless compile on ubuntu24
Recompile when shapeless
changes
Fix rope fallback to not upcast
Fix metal sort for certain cases
Fix a couple of slicing bugs
Avoid duplicate malloc with custom kernel init
Fix compilation error on Windows
Allow Python garbage collector to break cycles on custom objects
Fix grad with copies
Loading empty list is ok when strict = false
Fix split vmap
Fixes output donation for IO ops on the GPU
Fix creating an array with an int64 scalar
Catch stream errors earlier to avoid aborts
You can’t perform that action at this time.