Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression 0.7.2 -> 0.7.3 #127

Closed
RomeoV opened this issue Nov 25, 2023 · 4 comments
Closed

Regression 0.7.2 -> 0.7.3 #127

RomeoV opened this issue Nov 25, 2023 · 4 comments

Comments

@RomeoV
Copy link

RomeoV commented Nov 25, 2023

I made a little package that speeds up (dense) x (sparse) matmul (https://github.com/RomeoV/ThreadedDenseSparseMul.jl) and (happened to be) on Polyester 0.5 during development.

When updating (and bisecting) the Polyester version to the latest, I've noticed that Polyester 0.7.2 -> 0.7.3 broke my package functionality, such that the loop below maxes out my cpu but never finishes.

The loop is simply

import SparseArrays: SparseMatrixCSC, mul!; import SparseArrays
import Polyester: @batch

function mymul!(C::AbstractMatrix, A::AbstractMatrix, B::SparseMatrixCSC, α::Number, β::Number)
    @batch for j in axes(B, 2)
        C[:, j] .*= β
        C[:, j] .+= A *.*B[:, j])
    end
    return C
end

You can test it out with

julia> N, K, M = 1_000, 2_000, 30_000;

julia> A = rand(N, K); B = sprand(K, M, 0.05); C = similar(A, (N, M));

julia> mymul!(C, A, B, true, false)

Again, this computation doesn't finish with Polyester version >= 0.7.3 (but finishes very quickly with <0.7.3) and when we Ctrl-C we always find the cpu to currently eval

ERROR: InterruptException:
Stacktrace:
 [1] getindex(x::SparseMatrixCSC{Float64, Int64}, ::Colon, j::Int64)
   @ SparseArrays ~/.julia/juliaup/julia-1.10.0-rc1+0.x64.linux.gnu/share/julia/stdlib/v1.10/SparseArrays/src/sparsevector.jl:629

I saw that 0.7.3 changed a few dependencies around, so probably the problem got introduced somewhere there.
For now, I'll just restrict my package [compat], but maybe this points at a problem deeper down...
Any insights?

versioninfo() julia> versioninfo() Julia Version 1.10.0-rc1 Commit 5aaa9485436 (2023-11-03 07:44 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 16 × 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake) Threads: 23 on 16 virtual cores
RomeoV added a commit to RomeoV/ThreadedDenseSparseMul.jl that referenced this issue Nov 26, 2023
@chriselrod
Copy link
Member

chriselrod commented Nov 26, 2023

When updating (and bisecting) the Polyester version to the latest, I've noticed that Polyester 0.7.2 -> 0.7.3 broke my package functionality, such that the loop below maxes out my cpu but never finishes.

Not never, just very slow.

Reproducer, note the 10x smaller M than you used:

import SparseArrays: SparseMatrixCSC, mul!, sprand; import SparseArrays
import Polyester: @batch

function mymul!(C::AbstractMatrix, A::AbstractMatrix, B::SparseMatrixCSC, α::Number, β::Number)
    @batch for j in axes(B, 2)
        C[:, j] .*= β
        C[:, j] .+= A *.*B[:, j])
    end
    return C
end

N, K, M = 1_000, 2_000, 3_000;
A = rand(N, K); B = sprand(K, M, 0.05); C = similar(A, (N, M));
@time mymul!(C, A, B, true, false)

I get

julia> @time mymul!(C, A, B, true, false);
190.354090 seconds (2.71 M allocations: 308.822 MiB, 0.04% gc time, 1.81% compilation time)

vs

julia> @time mymul!(C, A, B, true, false);
  1.105294 seconds (2.61 M allocations: 195.908 MiB, 3.82% gc time, 93.76% compilation time)

using Polyester 0.7.3 vs 0.7.2, respectively.
I was running other parallel code at the same time as the 0.7.2 run, so it should be faster, but I don't feel like rerunning in a new session.

Second run:

julia> @time mymul!(C, A, B, true, false);
  0.050903 seconds (21.00 k allocations: 33.591 MiB)

vs

julia> @time mymul!(C, A, B, true, false);
108.885509 seconds (54.00 k allocations: 138.446 MiB, 0.01% gc time)

So one issue we can see is that we are allocating several times the memory,
That doesn't justify the runtime at all, but at least it suggests we're taking some other code paths that don't immediately jump out from looking at the typed code.
I also don't immediately see any type instabilities.

So, the next step is using PProf, Profile.

@chriselrod
Copy link
Member

0.7.2 calls mul!, while 0.7.3 calls generic_matvecmul!.

@chriselrod
Copy link
Member

0.7.2

julia> using StrideArraysCore

julia> PtrArray(A) isa DenseArray
true

0.7.3

julia> PtrArray(A) isa DenseArray
false

Because it isn't a DenseArray in 0.7.3, it hits the incredibly slow fallback method.

@chriselrod
Copy link
Member

Fixed JuliaSIMD/StrideArraysCore.jl@afd33f2

RomeoV referenced this issue in RomeoV/ThreadedDenseSparseMul.jl Nov 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants