v0.23.0

awni released this 14 Feb 21:39

· 42 commits to main since this release

6cec78d

Highlights

4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
More performance improvements across the board:
- Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
- Faster winograd convolutions, benchmarks
- Up to 3x faster sort, benchmarks
- Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
- Faster unified CPU back-end with vector operations
Double precision (mx.float64) support on the CPU

Core

Features

Bitwise invert mx.bitwise_invert
mx.linalg.lu, mx.linalg.lu_factor, mx.linalg.solve, mx.linalg.solve_triangular
Support loading F8_E4M3 from safetensors
mx.float64 supported on the CPU
Matmul JVPs
Distributed launch helper :mlx.launch
Support non-square QR factorization with mx.linalg.qr
Support ellipsis in mx.einsum
Refactor and unify accelerate and common back-ends

Performance

Faster synchronization Fence for synchronizing CPU-GPU
Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
Fast winograd convolutions, benchmarks
Allow dynamic ops per buffer based on dispatches and memory, benchmarks
Up to 3x faster sort, benchmarks
Faster small batch qmv, benchmarks
Ring distributed backend
- Uses raw sockets for faster all reduce
Some CPU ops are much faster with the new Simd<T, N>

NN

Orthogonal initializer nn.init.orthogonal
Add dilation for conv 3d layers

Bug fixes

Limit grad recursion depth by not recursing through non-grad inputs
Fix synchronization bug for GPU stream async CPU work
Fix shapeless compile on ubuntu24
Recompile when shapeless changes
Fix rope fallback to not upcast
Fix metal sort for certain cases
Fix a couple of slicing bugs
Avoid duplicate malloc with custom kernel init
Fix compilation error on Windows
Allow Python garbage collector to break cycles on custom objects
Fix grad with copies
Loading empty list is ok when strict = false
Fix split vmap
Fixes output donation for IO ops on the GPU
Fix creating an array with an int64 scalar
Catch stream errors earlier to avoid aborts

Assets 2