-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent subarray failures in Version 0.5.0-dev+2913 #15271
Comments
Sign, I was hoping that the array grow segfault (the only failure I can reproduce locally) can fix this issue too but apparently not.... |
Duplicate of #14991, but the name of that issue should be updated now that the error message is more accurately displayed in the backtrace |
And @colbec might be the first person to be able to reproduce this locally. How often does it happen for you? Anyone else have some suggestions for debugging this? Maybe if I reboot into opensuse I might be able to reproduce too, I haven't done that for a while. |
FWIW, for (IA, IB) in zip(eachindex(A), eachindex(B))
if A[IA] != B[IB]
isgood = false
break
end
end Is this guaranteed to work if |
@tkelman I just repeated the |
It should, each array is being indexed by its own "ideal" iterator. And of course, it should also work if that isn't true. I'm puzzled about why the arrays are not being printed, given these lines. save(joinpath(tempdir(), "subarraytests.jld"), "A", A, "B", B) then it should be possible to open that file after failure and inspect |
Just ran make testall again, test was successful. Evidently it is hard for me to repeat this error. |
|
@colbec, you should ignore the first suggestion (see #15271 (comment)), just do a |
@tkelman I tried the Two more successful runs completed, no error. This one is hard to repeat. |
testall produces output only because there are multiple processes started and many of the tests complete quickly---what you're seeing is the output of the completed tests. The reason the subarray tests are "silent" for a long time is that they take a long time to complete. This is true whether you use |
@timholy d6bc9c9 didn't take long to bear fruit on a buildbot http://build.julialang.org/builders/build_ubuntu14.04-x86/builds/386/steps/shell_2/logs/stdio
mean anything to you? |
Quite surprising. Can you locally add |
Concretely, running make test-subarray takes 8 minutes, testall 16 minutes on this machine. |
Latest test (Version 0.5.0-dev+2916 (2016-02-28 11:10 UTC)) produced:
|
Bingo. (Version 0.5.0-dev+2916 (2016-02-28 11:10 UTC))
... |
@colbec thanks for the issue name change, it is generally more helpful to have which test failed in the issue title rather than just what version you're running. Are you using the default opensuse 42.1 version of gcc, 4.8.5? Anything in your |
@tkelman The gcc is as supplied from openSUSE Leap 42.1standard repositories, updated very regularly. My system does not appear to have a file "Make.user" at all, so I guess this implies that defaults are used across the board in make. Everything seems to compile correctly on this machine, so there have been no tweaks which might affect other system components that I am aware of. |
Thanks for catching it again. Unfortunately it's still not enough information for me to figure out what's happening. Can I ask you to make the following change: diff --git a/test/subarray.jl b/test/subarray.jl
index 8aed951..dca4c77 100644
--- a/test/subarray.jl
+++ b/test/subarray.jl
@@ -94,7 +94,7 @@ function test_cartesian(A, B)
isgood = true
for (IA, IB) in zip(eachindex(A), eachindex(B))
if A[IA] != B[IB]
- @show IA IB A[IA] B[IB]
+ @show IA IB A[IA] B[IB] typeof(A) typeof(B) size(A) size(B) parentindexes(A) parentindexes(B)
isgood = false
break
end |
As a guide to reproduce, is it intermittent on a single build? I.e. if you rerun the test on the same build, is it reproducible? (Just want to know if there's sth broken in sysimg.) |
Thanks @colbec, I ask because my existing source build on opensuse was using gcc 5. I made a new one using the default and hit an odd intermittent BoundsError in indexing SparseVectors that may be related. @timholy here's a run with extra info from show https://build.julialang.org/builders/build_centos7.1-x64/builds/412/steps/shell_2/logs/stdio |
@yuyichao Intermittent on a single build. Reproducible only once in about 10 tests. It could be once in 20 or more, currently running tests to determine how intermittent it is, if that number exists at all. |
We should probably be trying to capture this in rr, shouldn't we. |
I've tried to capture it in |
@tkelman, that was very informative, thanks. The error is cropping up in view-of-views, but you could replicate the exact types like this: B = reshape(1:13^3, 13, 13, 13);
idx0 = 1:13
idx1 = sub(idx0, [2,1,5])
A = sub(B, idx1, 2:5, :)
A[CartesianIndex((1,1,2))] which, unfortunately, gives the correct answer of 184. All the types & parameters also seem OK. In other words, it's really that the last expression seems to sometimes give the wrong answer---as if it were |
@timholy Do you mean that running this code should allow reproducing the bug? FWIW, it passes here hundreds of times. |
Another on the buildbot, doesn't look like much new info though https://build.julialang.org/builders/build_ubuntu14.04-x64/builds/400/steps/shell_2/logs/stdio
|
Another output:
|
My hunch: somehow the argument tuple length is getting corrupted when generating the cartesian indexing expressions in Base.cartindex_exprs. I'd bet it's only expanding the CartesianIndex to the first two indices. This may also explain the backtrace in #15151, where it's trying to index Not sure where or how we should add more debugging info to prove or disprove this. |
We have been getting BoundsErrors intermittently, seemingly randomly, and in places they wouldn't be expected. There's one that's happened a few times in |
Good guess @mbauman. It is suspicious that it always passes for In hopes of catching this (I've never seen in on my own machines), I'm running 8 parallel subarray tests right now. But in case I can't trigger it, I'm trying this: diff --git a/test/subarray.jl b/test/subarray.jl
index 8aed951..4ca4ae5 100644
--- a/test/subarray.jl
+++ b/test/subarray.jl
@@ -94,7 +94,8 @@ function test_cartesian(A, B)
isgood = true
for (IA, IB) in zip(eachindex(A), eachindex(B))
if A[IA] != B[IB]
- @show IA IB A[IA] B[IB]
+ @show IA IB A[IA] B[IB] typeof(A) typeof(B) size(A) size(B)
+ @show @code_typed Base._getindex(Base.linearindexing(A), A, IA)
isgood = false
break
end That may not be enough info (maybe want |
Saw what looks like a different manifestation of the same problem:
|
I propose closing this issue if we don't see any new failure in the next few days. |
@yuyichao After regular testing (well in excess of the number of tests required to pop the error previously) over the last 48 hrs I have been unable to generate the error again. Something seems to have improved the situation - if the issue was based on multiple bugs then perhaps just one is fixed, reducing the probability that I would see it? I propose to pull back on frequency of tests and refocus on ...um... work. |
Thanks again @colbec, having verification that an issue can be reproduced by anyone locally is always reassuring, and it looks like at least one of the causes here has been addressed. The remaining causes (I don't think it has completely gone away on travis or appveyor yet) must be harder to reproduce locally so we'll keep searching. |
From latest version 0.5.0-dev+3209 (2016-03-19 00:44 UTC) :
|
After many failed attempt with reproducing in It turns out that this failure is very reproducible when running with 4 process with all the test, which makes me suspect it's sensitive to which test has been run on the same process before the failing test. BT:
asm of the failing function llvm ir after codegen llvm ir after finalizing gc frame llvm ir after optimization The tests are running on the yuyichao/julia:yyc/tests/subarray branch and the The process is still sleeping so I can take request to check different informations. |
Oh, and the test failure output
|
@yuyichao suggests in this thread "which makes me suspect it's sensitive to which test has been run on the same process before the failing test". In #15420 I speculated that "Previously the same worker used by arrayops was running staged. If staged left something odd behind would this trip arrayops early?" It may be that I am confusing "worker" with "process" here, but #15420 includes full output in a gist, which may be of some help. |
cool. bisected to probably a inference / staged-function / tfunc bug. |
We narrowed down the problem to one of the generated function returning a wrong typed AST that corresponds to a similar but different type Lines 872 to 896 in 3c9bf14
|
… growth unlike regular functions, which can often still be inferred reasonably by restricting the type signature fix #15271
… growth unlike regular functions, which can often still be inferred reasonably by restricting the type signature fix #15271
I think typemap & associated changes should have addressed this. Indeed, I haven't seen this recently. |
In
make testall fails with unusual output from subarray (long printing of contents of an array A) and finally:
The text was updated successfully, but these errors were encountered: