Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent subarray failures in Version 0.5.0-dev+2913 #15271

Closed
colbec opened this issue Feb 28, 2016 · 45 comments
Closed

Intermittent subarray failures in Version 0.5.0-dev+2913 #15271

colbec opened this issue Feb 28, 2016 · 45 comments
Labels
heisenbug This bug occurs unpredictably priority This should be addressed urgently test This change adds or pertains to unit tests

Comments

@colbec
Copy link

colbec commented Feb 28, 2016

In

versioninfo()
Julia Version 0.5.0-dev+2913
Commit 7199602* (2016-02-28 08:14 UTC)
Platform Info:
  System: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1

make testall fails with unusual output from subarray (long printing of contents of an array A) and finally:

    From worker 4:       * parallel              in  48.01 seconds, maxrss  773.58 MB
    From worker 5:       * examples              in  17.36 seconds, maxrss  867.44 MB
Exception running test subarray :
On worker 3:
LoadError: Mismatch
 in test_cartesian at /home/colin/Downloads/julia/test/subarray.jl:104
 in runtests at /home/colin/Downloads/julia/test/subarray.jl:228
 in runviews at /home/colin/Downloads/julia/test/subarray.jl:249
 [inlined code] from /home/colin/Downloads/julia/test/subarray.jl:326
 in anonymous at ./no file:4294967295
 [inlined code] from ./essentials.jl:78
 in include_string at ./loading.jl:371
 in include_from_node1 at ./loading.jl:420
 [inlined code] from ./util.jl:179
 in runtests at /home/colin/Downloads/julia/test/testdefs.jl:7
 in #16 at /home/colin/Downloads/julia/test/runtests.jl:36
 in run_work_thunk at ./multi.jl:714
 [inlined code] from ./multi.jl:1010
 in #253 at ./task.jl:59
while loading /home/colin/Downloads/julia/test/subarray.jl, in expression starting on line 308
ERROR: LoadError: Some tests exited with errors.
 in (::##11#19)() at /home/colin/Downloads/julia/test/runtests.jl:64
 in cd(::##11#19, ::ASCIIString) at ./file.jl:48
 in include(::ASCIIString) at ./boot.jl:264
 in include_from_node1(::UTF8String) at ./loading.jl:417
 in process_options(::Base.JLOptions) at ./client.jl:262
 in _start() at ./client.jl:318
while loading /home/colin/Downloads/julia/test/runtests.jl, in expression starting on line 13
Makefile:12: recipe for target 'all' failed
make[1]: *** [all] Error 1
Makefile:573: recipe for target 'testall' failed
make: *** [testall] Error 2
@yuyichao
Copy link
Contributor

Sign, I was hoping that the array grow segfault (the only failure I can reproduce locally) can fix this issue too but apparently not....

@tkelman
Copy link
Contributor

tkelman commented Feb 28, 2016

Duplicate of #14991, but the name of that issue should be updated now that the error message is more accurately displayed in the backtrace

@tkelman
Copy link
Contributor

tkelman commented Feb 28, 2016

And @colbec might be the first person to be able to reproduce this locally. How often does it happen for you?

Anyone else have some suggestions for debugging this? Maybe if I reboot into opensuse I might be able to reproduce too, I haven't done that for a while.

@yuyichao
Copy link
Contributor

FWIW,

    for (IA, IB) in zip(eachindex(A), eachindex(B))
        if A[IA] != B[IB]
            isgood = false
            break
        end
    end

Is this guaranteed to work if A and B have different indexing mode?

@colbec
Copy link
Author

colbec commented Feb 28, 2016

@tkelman I just repeated the make testall, versioninfo identical and it ran correctly to SUCCESS. So the frequency right now is best guess 50% but sample size is very small.

@timholy
Copy link
Member

timholy commented Feb 28, 2016

Is this guaranteed to work if A and B have different indexing mode?

It should, each array is being indexed by its own "ideal" iterator. And of course, it should also work if that isn't true.

I'm puzzled about why the arrays are not being printed, given these lines. @colbec, if you'd be willing to Pkg.add("JLD"), put using JLD at the top of test/subarray.jl, and then modify those lines to this:

save(joinpath(tempdir(), "subarraytests.jld"), "A", A, "B", B)

then it should be possible to open that file after failure and inspect A and B for differences.

@timholy
Copy link
Member

timholy commented Feb 28, 2016

Oh wait, #14991 does include a complete dump, and they look the same.

I just pushed d6bc9c9 in hopes of catching this and making it easier to debug.

@colbec
Copy link
Author

colbec commented Feb 28, 2016

Just ran make testall again, test was successful. Evidently it is hard for me to repeat this error.
One of the issues is that running make testall takes quite a while on my machine, about 10-15 minutes, so if anyone has advice on how to speed up the process without excluding sections which might contribute to the error it would enable me to iterate more productively.
@timholy Got the suggestion, will try to implement as and when I can.

@tkelman
Copy link
Contributor

tkelman commented Feb 28, 2016

make test-subarray will run just the test in question, but I'm not sure whether that would be more or less likely to trigger the problem (per unit time) than make testall. If you've got other things you need to be using the CPU cycles for, we can try to find another system where it's possible to reproduce this locally (I've been trying for a while on various Ubuntu and Windows systems, without much luck yet). Thanks for the report!

@timholy
Copy link
Member

timholy commented Feb 28, 2016

@colbec, you should ignore the first suggestion (see #15271 (comment)), just do a git pull to update to the latest master and continue as usual.

@colbec
Copy link
Author

colbec commented Feb 28, 2016

@tkelman I tried the make test-subarray - this initiated a looooong silent process which is troublesome since it is not informative. testall has the advantage that it shows that something reasonable is happening. I will continue to try testall.
@timholy got the revision, continuing to test with the updated Julia (Version 0.5.0-dev+2916).

Two more successful runs completed, no error. This one is hard to repeat.
I wondered if this might be a case of machine rounding - referring to https://groups.google.com/forum/#!topic/julia-users/x24ePp8mOXc I ran a test (abs(44100.0 - 44099.0 * (1.0/1))) and my machine seems to behave correctly.

@timholy
Copy link
Member

timholy commented Feb 28, 2016

testall produces output only because there are multiple processes started and many of the tests complete quickly---what you're seeing is the output of the completed tests. The reason the subarray tests are "silent" for a long time is that they take a long time to complete. This is true whether you use make testall or make test-subarray.

@tkelman
Copy link
Contributor

tkelman commented Feb 28, 2016

@timholy d6bc9c9 didn't take long to bear fruit on a buildbot http://build.julialang.org/builders/build_ubuntu14.04-x86/builds/386/steps/shell_2/logs/stdio

From worker 3:       * subarray             IA = CartesianIndex{3}((1,1,2))
    From worker 3:  IB = 2
    From worker 3:  A[IA] = 877
    From worker 3:  B[IB] = 370
    From worker 3:  A = [877]
    From worker 3:  
    From worker 3:  [370]
    From worker 3:  
    From worker 3:  [1046]
    From worker 3:  B = [877]
    From worker 3:  
    From worker 3:  [370]
    From worker 3:  
    From worker 3:  [1046]

mean anything to you?

@timholy
Copy link
Member

timholy commented Feb 28, 2016

Quite surprising. Can you locally add typeof(A) typeof(B) size(A) size(B) to that same @show line, and run again?

@colbec
Copy link
Author

colbec commented Feb 28, 2016

Concretely, running make test-subarray takes 8 minutes, testall 16 minutes on this machine.
Running test-subarray produced 4 messages Master process (id 1) could not connect within 60.0 seconds. exiting. which do not seem to appear when running testall.
Two more tests, still cannot repeat.

tkelman added a commit that referenced this issue Feb 28, 2016
@colbec
Copy link
Author

colbec commented Feb 28, 2016

Latest test (Version 0.5.0-dev+2916 (2016-02-28 11:10 UTC)) produced:

make test-subarray
    JULIA test/subarray
     * subarray             fatal error on 4fatal error on 3fatal error on 2 in 513.13 seconds, maxrss  872.16 MB
    SUCCESS

@colbec
Copy link
Author

colbec commented Feb 28, 2016

Bingo. (Version 0.5.0-dev+2916 (2016-02-28 11:10 UTC))
...

make testall
cp /home/colin/Downloads/julia/usr/lib/julia/sys.so /home/colin/Downloads/julia/local.so && /home/colin/Downloads/julia/usr/bin/julia -J /home/colin/Downloads/julia/local.so -e 'true' && rm /home/colin/Downloads/julia/local.so
    JULIA test/all
    From worker 5:       * linalg/matmul         in  27.16 seconds, maxrss  257.29 MB
    From worker 5:       * linalg/schur          in  14.36 seconds, maxrss  263.92 MB
    From worker 5:       * linalg/special        in   7.12 seconds, maxrss  266.20 MB
    From worker 5:       * linalg/eigen          in  12.82 seconds, maxrss  293.04 MB
    From worker 5:       * linalg/bunchkaufman   in   6.81 seconds, maxrss  294.13 MB
    From worker 4:       * linalg/dense          in  72.47 seconds, maxrss  317.61 MB
    From worker 5:       * linalg/svd            in  14.23 seconds, maxrss  305.46 MB
    From worker 3:       * linalg/qr             in  90.25 seconds, maxrss  319.27 MB
    From worker 5:       * linalg/tridiag        in  17.44 seconds, maxrss  367.92 MB
    From worker 4:       * linalg/lapack         in  28.26 seconds, maxrss  321.31 MB
    From worker 4:       * linalg/pinv           in   5.52 seconds, maxrss  396.34 MB
    From worker 4:       * linalg/givens         in   2.86 seconds, maxrss  396.34 MB
    From worker 3:       * linalg/bidiag         in  26.56 seconds, maxrss  370.00 MB
    From worker 5:       * linalg/diagonal       in  39.21 seconds, maxrss  394.05 MB
    From worker 3:       * linalg/lu             in  25.22 seconds, maxrss  419.01 MB
    From worker 4:       * linalg/cholesky       in  35.28 seconds, maxrss  396.34 MB
    From worker 4:       * linalg/uniformscaling in   3.52 seconds, maxrss  396.34 MB
    From worker 3:       * linalg/generic        in   6.93 seconds, maxrss  425.97 MB
    From worker 5:       * linalg/symmetric      in  12.78 seconds, maxrss  423.00 MB
WARNING: Method definition f() in module JLCall14301 at /home/colin/Downloads/julia/test/core.jl:3550 overwritten at /home/colin/Downloads/julia/test/core.jl:3560.
    From worker 5:       * core                  in  17.53 seconds, maxrss  443.78 MB
    From worker 5:       * keywordargs           in   1.28 seconds, maxrss  443.78 MB
    From worker 4:       * linalg/lq             in  30.56 seconds, maxrss  435.93 MB
    From worker 4:       * printf                in   5.93 seconds, maxrss  442.76 MB
    From worker 4:       * char                  in   1.18 seconds, maxrss  465.43 MB
    From worker 3:       * linalg/arnoldi        in  37.94 seconds, maxrss  465.22 MB
    From worker 3:       * triplequote           in   0.05 seconds, maxrss  465.22 MB
    From worker 3:       * unicode               in  12.21 seconds, maxrss  497.63 MB
    From worker 4:       * string                in  20.83 seconds, maxrss  533.10 MB
    From worker 4:       * dict                  in  13.56 seconds, maxrss  533.10 MB
    From worker 4:       * hashing               in   9.00 seconds, maxrss  533.10 MB
    From worker 4:       * remote                in   0.54 seconds, maxrss  533.10 MB
    From worker 4:       * iobuffer              in   1.03 seconds, maxrss  533.10 MB
    From worker 4:       * staged                in   1.53 seconds, maxrss  533.10 MB
    From worker 5:       * numbers               in  73.84 seconds, maxrss  567.60 MB
    From worker 3:       * dates                 in  46.00 seconds, maxrss  567.09 MB
    From worker 5:       * tuple                 in   0.83 seconds, maxrss  568.38 MB
    From worker 5:       * reduce                in   4.29 seconds, maxrss  577.46 MB
    From worker 5:       * reducedim             in  16.51 seconds, maxrss  585.23 MB
    From worker 4:       * arrayops              in  44.79 seconds, maxrss  758.52 MB
    From worker 5:       * random                in  18.15 seconds, maxrss  603.73 MB
    From worker 5:       * intfuncs              in   1.62 seconds, maxrss  608.11 MB
    From worker 5:       * simdloop              in   1.97 seconds, maxrss  610.17 MB
    From worker 5:       * blas                  in   9.46 seconds, maxrss  614.54 MB
    From worker 4:       * abstractarray         in  35.51 seconds, maxrss  758.52 MB
    From worker 4:       * bitarray              in  73.81 seconds, maxrss  758.52 MB
    From worker 4:       * copy                  in   2.18 seconds, maxrss  758.52 MB
    From worker 2:       * linalg/triangular     in 413.74 seconds, maxrss  727.40 MB
    From worker 2:       * fastmath              in  13.40 seconds, maxrss  735.18 MB
    From worker 4:       * math                  in  41.92 seconds, maxrss  758.52 MB
    From worker 2:       * functional            in   3.63 seconds, maxrss  738.14 MB
    From worker 4:       * operators             in   0.73 seconds, maxrss  758.52 MB
    From worker 2:       * path                  in   4.71 seconds, maxrss  750.70 MB
    From worker 2:       * parse                 in   3.38 seconds, maxrss  751.59 MB
    From worker 2:       * loading               in   0.52 seconds, maxrss  751.85 MB
    From worker 4:       * ccall                 in   9.87 seconds, maxrss  759.36 MB
    From worker 2:       * bigint                in   4.40 seconds, maxrss  756.42 MB
    From worker 2:       * statistics            in  25.08 seconds, maxrss  794.11 MB
    From worker 2:       * spawn                       [stdio passthrough ok]
    From worker 2:   in  15.28 seconds, maxrss  804.09 MB
    From worker 2:       * backtrace             in   0.84 seconds, maxrss  805.79 MB
    From worker 2:       * priorityqueue         in   2.00 seconds, maxrss  822.34 MB
    From worker 4:       * sorting               in  55.90 seconds, maxrss  810.77 MB
    From worker 3:       * subarray             IA = CartesianIndex{3}((1,1,2))
    From worker 3:  IB = 14
    From worker 3:  A[IA] = 508
    From worker 3:  B[IB] = 1
    From worker 3:  A = [508
    From worker 3:   509
    From worker 3:   510
    From worker 3:   511
    From worker 3:   512
    From worker 3:   513
    From worker 3:   514
    From worker 3:   515
    From worker 3:   516
    From worker 3:   517
    From worker 3:   518
    From worker 3:   519
    From worker 3:   520]
    From worker 3:  
    From worker 3:  [1
    From worker 3:   2
    From worker 3:   3
    From worker 3:   4
    From worker 3:   5
    From worker 3:   6
    From worker 3:   7
    From worker 3:   8
    From worker 3:   9
    From worker 3:   10
    From worker 3:   11
    From worker 3:   12
    From worker 3:   13]
    From worker 3:  
    From worker 3:  [677
    From worker 3:   678
    From worker 3:   679
    From worker 3:   680
    From worker 3:   681
    From worker 3:   682
    From worker 3:   683
    From worker 3:   684
    From worker 3:   685
    From worker 3:   686
    From worker 3:   687
    From worker 3:   688
    From worker 3:   689]
    From worker 3:  B = [508
    From worker 3:   509
    From worker 3:   510
    From worker 3:   511
    From worker 3:   512
    From worker 3:   513
    From worker 3:   514
    From worker 3:   515
    From worker 3:   516
    From worker 3:   517
    From worker 3:   518
    From worker 3:   519
    From worker 3:   520]
    From worker 3:  
    From worker 3:  [1
    From worker 3:   2
    From worker 3:   3
    From worker 3:   4
    From worker 3:   5
    From worker 3:   6
    From worker 3:   7
    From worker 3:   8
    From worker 3:   9
    From worker 3:   10
    From worker 3:   11
    From worker 3:   12
    From worker 3:   13]
    From worker 3:  
    From worker 3:  [677
    From worker 3:   678
    From worker 3:   679
    From worker 3:   680
    From worker 3:   681
    From worker 3:   682
    From worker 3:   683
    From worker 3:   684
    From worker 3:   685
    From worker 3:   686
    From worker 3:   687
    From worker 3:   688
    From worker 3:   689]
    From worker 6:       * mmap                  in   5.23 seconds, maxrss  179.70 MB
    From worker 6:       * version               in   1.09 seconds, maxrss  181.36 MB
    From worker 6:       * resolve               in   4.00 seconds, maxrss  200.75 MB
    From worker 4:       * read                  in  18.89 seconds, maxrss  846.10 MB
    From worker 6:       * pollfd                in   3.59 seconds, maxrss  203.37 MB
    From worker 4:       * mpfr                  in   5.76 seconds, maxrss  849.02 MB
    From worker 4:       * complex               in   6.02 seconds, maxrss  851.07 MB
    From worker 2:       * file                  in  42.41 seconds, maxrss  839.15 MB
    From worker 2:       * floatapprox           in   1.02 seconds, maxrss  839.15 MB
    From worker 4:       * socket                in   4.09 seconds, maxrss  852.36 MB
The following 'Returned code...' warnings indicate normal behavior:
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
    From worker 6:       * broadcast             in  17.49 seconds, maxrss  249.36 MB
    From worker 4:       * reflection            in   3.28 seconds, maxrss  854.83 MB
    From worker 6:       * regex                 in   1.73 seconds, maxrss  256.60 MB
    From worker 4:       * float16               in   1.95 seconds, maxrss  855.60 MB
    From worker 6:       * combinatorics         in   0.75 seconds, maxrss  257.21 MB
    From worker 6:       * rounding              in   1.19 seconds, maxrss  261.52 MB
    From worker 4:       * sysinfo               in   2.00 seconds, maxrss  857.10 MB
    From worker 4:       * mod2pi                in   0.27 seconds, maxrss  857.10 MB
    From worker 4:       * euler                 in   1.57 seconds, maxrss  862.19 MB
    From worker 4:       * show                  in  15.11 seconds, maxrss  865.70 MB
    From worker 4:       * lineedit              in   6.77 seconds, maxrss  866.57 MB
    From worker 4:       * replcompletions       in   6.40 seconds, maxrss  866.57 MB
    From worker 6:       * ranges                in  31.09 seconds, maxrss  300.41 MB
    From worker 6:       * replutil              in   4.77 seconds, maxrss  313.66 MB
    From worker 6:       * sets                  in   2.39 seconds, maxrss  319.78 MB
    From worker 4:       * repl                  in  10.80 seconds, maxrss  866.57 MB
    From worker 4:       * goto                  in   0.11 seconds, maxrss  866.57 MB
    From worker 4:       * llvmcall              in   0.20 seconds, maxrss  866.57 MB
    From worker 4:       * grisu                 in   3.16 seconds, maxrss  866.57 MB
    From worker 6:       * test                  in   7.27 seconds, maxrss  357.61 MB
    From worker 6:       * meta                  in   0.21 seconds, maxrss  357.86 MB
    From worker 6:       * stacktraces           in   1.34 seconds, maxrss  358.12 MB
    From worker 6:       * profile               in   3.05 seconds, maxrss  358.38 MB
    From worker 5:       * sparse                in 294.15 seconds, maxrss  836.20 MB
    From worker 4:       * nullable              in   8.41 seconds, maxrss  866.57 MB
    From worker 6:       * libgit2               in   5.13 seconds, maxrss  358.38 MB
    From worker 6:       * base64                in   0.29 seconds, maxrss  358.38 MB
    From worker 5:       * docs                  in   8.30 seconds, maxrss  843.86 MB
    From worker 2:       * datafmt               in  70.04 seconds, maxrss  842.07 MB
    From worker 4:       * markdown              in   9.65 seconds, maxrss  866.57 MB
Warning: threaded loop executed in order
    From worker 5:       * functors              in   1.89 seconds, maxrss  844.88 MB
    From worker 6:       * serialize             in  10.30 seconds, maxrss  358.38 MB
    From worker 4:       * threads               in   2.66 seconds, maxrss  871.09 MB
    From worker 4:       * i18n                  in   0.04 seconds, maxrss  871.09 MB
    From worker 4:       * workspace             in   0.45 seconds, maxrss  871.09 MB
    From worker 5:       * enums                 in   2.62 seconds, maxrss  845.65 MB
    From worker 4:       * libdl                 in   0.83 seconds, maxrss  872.69 MB
    From worker 4:       * checked               in   9.13 seconds, maxrss  912.95 MB
    From worker 4:       * intset                in   1.21 seconds, maxrss  913.29 MB
    From worker 5:       * int                   in  12.34 seconds, maxrss  851.58 MB
    From worker 4:       * floatfuncs            in   3.84 seconds, maxrss  914.14 MB
    From worker 4:       * inline                in   0.67 seconds, maxrss  914.31 MB
    From worker 4:       * boundscheck           in   0.24 seconds, maxrss  914.44 MB
    From worker 6:       * cmdlineargs           in  38.98 seconds, maxrss  358.38 MB
    From worker 4:       * fft                   in  25.30 seconds, maxrss  943.68 MB
    From worker 6:       * dsp                   in  11.85 seconds, maxrss  362.28 MB
    From worker 4:       * examples              in  17.26 seconds, maxrss  964.99 MB
    From worker 5:       * parallel              in  49.93 seconds, maxrss  852.10 MB
    From worker 2:       * misc                  in 134.16 seconds, maxrss 1059.23 MB
Exception running test subarray :
On worker 3:
LoadError: Mismatch
 in test_cartesian at /home/colin/Downloads/julia/test/subarray.jl:105
 in runtests at /home/colin/Downloads/julia/test/subarray.jl:229
 in runviews at /home/colin/Downloads/julia/test/subarray.jl:250
 [inlined code] from /home/colin/Downloads/julia/test/subarray.jl:327
 in anonymous at ./no file:4294967295
 [inlined code] from ./essentials.jl:78
 in include_string at ./loading.jl:371
 in include_from_node1 at ./loading.jl:420
 [inlined code] from ./util.jl:179
 in runtests at /home/colin/Downloads/julia/test/testdefs.jl:7
 in #16 at /home/colin/Downloads/julia/test/runtests.jl:36
 in run_work_thunk at ./multi.jl:714
 [inlined code] from ./multi.jl:1010
 in #253 at ./task.jl:59
while loading /home/colin/Downloads/julia/test/subarray.jl, in expression starting on line 309
ERROR: LoadError: Some tests exited with errors.
 in (::##11#19)() at /home/colin/Downloads/julia/test/runtests.jl:64
 in cd(::##11#19, ::ASCIIString) at ./file.jl:48
 in include(::ASCIIString) at ./boot.jl:264
 in include_from_node1(::UTF8String) at ./loading.jl:417
 in process_options(::Base.JLOptions) at ./client.jl:262
 in _start() at ./client.jl:318
while loading /home/colin/Downloads/julia/test/runtests.jl, in expression starting on line 13
Makefile:12: recipe for target 'all' failed
make[1]: *** [all] Error 1
Makefile:573: recipe for target 'testall' failed
make: *** [testall] Error 2

...

@colbec colbec changed the title make testall fails in Version 0.5.0-dev+2913 Intermittent subarray failures in Version 0.5.0-dev+2913 Feb 28, 2016
@tkelman
Copy link
Contributor

tkelman commented Feb 28, 2016

@colbec thanks for the issue name change, it is generally more helpful to have which test failed in the issue title rather than just what version you're running.

Are you using the default opensuse 42.1 version of gcc, 4.8.5? Anything in your Make.user or otherwise non-default about your build?

@colbec
Copy link
Author

colbec commented Feb 28, 2016

@tkelman The gcc is as supplied from openSUSE Leap 42.1standard repositories, updated very regularly. My system does not appear to have a file "Make.user" at all, so I guess this implies that defaults are used across the board in make. Everything seems to compile correctly on this machine, so there have been no tweaks which might affect other system components that I am aware of.

@timholy
Copy link
Member

timholy commented Feb 28, 2016

Thanks for catching it again. Unfortunately it's still not enough information for me to figure out what's happening. Can I ask you to make the following change:

diff --git a/test/subarray.jl b/test/subarray.jl
index 8aed951..dca4c77 100644
--- a/test/subarray.jl
+++ b/test/subarray.jl
@@ -94,7 +94,7 @@ function test_cartesian(A, B)
     isgood = true
     for (IA, IB) in zip(eachindex(A), eachindex(B))
         if A[IA] != B[IB]
-            @show IA IB A[IA] B[IB]
+            @show IA IB A[IA] B[IB] typeof(A) typeof(B) size(A) size(B) parentindexes(A) parentindexes(B)
             isgood = false
             break
         end

@yuyichao
Copy link
Contributor

As a guide to reproduce, is it intermittent on a single build? I.e. if you rerun the test on the same build, is it reproducible? (Just want to know if there's sth broken in sysimg.)

@tkelman
Copy link
Contributor

tkelman commented Feb 28, 2016

Thanks @colbec, I ask because my existing source build on opensuse was using gcc 5. I made a new one using the default and hit an odd intermittent BoundsError in indexing SparseVectors that may be related.

@timholy here's a run with extra info from show https://build.julialang.org/builders/build_centos7.1-x64/builds/412/steps/shell_2/logs/stdio

@colbec
Copy link
Author

colbec commented Feb 28, 2016

@yuyichao Intermittent on a single build. Reproducible only once in about 10 tests. It could be once in 20 or more, currently running tests to determine how intermittent it is, if that number exists at all.

@tkelman
Copy link
Contributor

tkelman commented Feb 28, 2016

We should probably be trying to capture this in rr, shouldn't we.

@yuyichao
Copy link
Contributor

I've tried to capture it in rr. With a single test, I can only hit #15222. (and it passes ~200 times yesterday on that branch....). I've also tried to rr the whole test last night but it get stuck in the spawn test. I'm now trying to put it in sleep after hitting the issue so that I can try to attach a debugger...

@timholy
Copy link
Member

timholy commented Feb 28, 2016

@tkelman, that was very informative, thanks. The error is cropping up in view-of-views, but you could replicate the exact types like this:

B = reshape(1:13^3, 13, 13, 13);
idx0 = 1:13
idx1 = sub(idx0, [2,1,5])
A = sub(B, idx1, 2:5, :)
A[CartesianIndex((1,1,2))]

which, unfortunately, gives the correct answer of 184. All the types & parameters also seem OK. In other words, it's really that the last expression seems to sometimes give the wrong answer---as if it were CartesianIndex((1,1,1)) rather than (1,1,2).

@nalimilan
Copy link
Member

@timholy Do you mean that running this code should allow reproducing the bug? FWIW, it passes here hundreds of times.

@tkelman
Copy link
Contributor

tkelman commented Mar 1, 2016

Another on the buildbot, doesn't look like much new info though https://build.julialang.org/builders/build_ubuntu14.04-x64/builds/400/steps/shell_2/logs/stdio

    From worker 3:       * subarray             IA = CartesianIndex{3}((1,1,2))
    From worker 3:  IB = 6
    From worker 3:  A[IA] = 370
    From worker 3:  B[IB] = 708
    From worker 3:  A = [370 383 396 409 422]
    From worker 3:  
    From worker 3:  [708 721 734 747 760]
    From worker 3:  
    From worker 3:  [1046 1059 1072 1085 1098]
    From worker 3:  B = [370 383 396 409 422]
    From worker 3:  
    From worker 3:  [708 721 734 747 760]
    From worker 3:  
    From worker 3:  [1046 1059 1072 1085 1098]

@JeffBezanson JeffBezanson added priority This should be addressed urgently test This change adds or pertains to unit tests labels Mar 1, 2016
@JeffBezanson
Copy link
Member

Another output:

From worker 3:       * subarray             IA = CartesianIndex{3}((1,1,2))
    From worker 3:  IB = 17
    From worker 3:  A[IA] = 522
    From worker 3:  B[IB] = 15
    From worker 3:  A = [522 535 548 561
    From worker 3:   523 536 549 562
    From worker 3:   524 537 550 563
    From worker 3:   525 538 551 564]
    From worker 3:  
    From worker 3:  [15 28 41 54
    From worker 3:   16 29 42 55
    From worker 3:   17 30 43 56
    From worker 3:   18 31 44 57]
    From worker 3:  
    From worker 3:  [691 704 717 730
    From worker 3:   692 705 718 731
    From worker 3:   693 706 719 732
    From worker 3:   694 707 720 733]
    From worker 3:  B = [522 535 548 561
    From worker 3:   523 536 549 562
    From worker 3:   524 537 550 563
    From worker 3:   525 538 551 564]
    From worker 3:  
    From worker 3:  [15 28 41 54
    From worker 3:   16 29 42 55
    From worker 3:   17 30 43 56
    From worker 3:   18 31 44 57]
    From worker 3:  
    From worker 3:  [691 704 717 730
    From worker 3:   692 705 718 731
    From worker 3:   693 706 719 732
    From worker 3:   694 707 720 733]

@mbauman
Copy link
Member

mbauman commented Mar 1, 2016

My hunch: somehow the argument tuple length is getting corrupted when generating the cartesian indexing expressions in Base.cartindex_exprs. I'd bet it's only expanding the CartesianIndex to the first two indices.

This may also explain the backtrace in #15151, where it's trying to index (1,)[2] while checking bounds — likely also the argument tuple.

Not sure where or how we should add more debugging info to prove or disprove this.

@tkelman
Copy link
Contributor

tkelman commented Mar 1, 2016

We have been getting BoundsErrors intermittently, seemingly randomly, and in places they wouldn't be expected. There's one that's happened a few times in idct that I've noticed. Sounds like the most plausible theory so far.

@timholy
Copy link
Member

timholy commented Mar 1, 2016

Good guess @mbauman. It is suspicious that it always passes for CartesianIndex((i,j,1)) and then fails immediately at CartesianIndex((1,1,2)).

In hopes of catching this (I've never seen in on my own machines), I'm running 8 parallel subarray tests right now. But in case I can't trigger it, I'm trying this:

 diff --git a/test/subarray.jl b/test/subarray.jl
index 8aed951..4ca4ae5 100644
--- a/test/subarray.jl
+++ b/test/subarray.jl
@@ -94,7 +94,8 @@ function test_cartesian(A, B)
     isgood = true
     for (IA, IB) in zip(eachindex(A), eachindex(B))
         if A[IA] != B[IB]
-            @show IA IB A[IA] B[IB]
+            @show IA IB A[IA] B[IB] typeof(A) typeof(B) size(A) size(B)
+            @show @code_typed Base._getindex(Base.linearindexing(A), A, IA)
             isgood = false
             break
         end

That may not be enough info (maybe want @code_llvm), but I thought I'd start there.

@JeffBezanson
Copy link
Member

Saw what looks like a different manifestation of the same problem:

    From worker 3:       * arrayops             Test Failed
    From worker 3:    Expression: mdsum(B) == 15
    From worker 3:     Evaluated: 5.0 == 15

@yuyichao
Copy link
Contributor

yuyichao commented Mar 2, 2016

I propose closing this issue if we don't see any new failure in the next few days.

This was referenced Mar 3, 2016
@colbec
Copy link
Author

colbec commented Mar 4, 2016

@yuyichao After regular testing (well in excess of the number of tests required to pop the error previously) over the last 48 hrs I have been unable to generate the error again. Something seems to have improved the situation - if the issue was based on multiple bugs then perhaps just one is fixed, reducing the probability that I would see it? I propose to pull back on frequency of tests and refocus on ...um... work.

@tkelman
Copy link
Contributor

tkelman commented Mar 4, 2016

Thanks again @colbec, having verification that an issue can be reproduced by anyone locally is always reassuring, and it looks like at least one of the causes here has been addressed. The remaining causes (I don't think it has completely gone away on travis or appveyor yet) must be harder to reproduce locally so we'll keep searching.

@vtjnash vtjnash added the heisenbug This bug occurs unpredictably label Mar 8, 2016
@colbec
Copy link
Author

colbec commented Mar 19, 2016

From latest version 0.5.0-dev+3209 (2016-03-19 00:44 UTC) :

    From worker 3:       * subarray             IA = CartesianIndex{3}((1,1,2))
    From worker 3:  IB = 14
    From worker 3:  A[IA] = 170
    From worker 3:  B[IB] = 1
    From worker 3:  typeof(A) = SubArray{Int64,3,Array{Int64,3},Tuple{Array{Int64,1},Colon,SubArray{Int64,1,UnitRange{Int64},Tuple{Array{Int64,1}},false}},false}
    From worker 3:  typeof(B) = Array{Int64,3}
    From worker 3:  size(A) = (1,13,3)
    From worker 3:  size(B) = (1,13,3)
    From worker 3:  parentindexes(A) = ([1],Colon(),[2,1,5])
    From worker 3:  parentindexes(B) = (1:1,1:13,1:3)
    From worker 3:  A = [170 183 196 209 222 235 248 261 274 287 300 313 326]
    From worker 3:  
    From worker 3:  [1 14 27 40 53 66 79 92 105 118 131 144 157]
    From worker 3:  
    From worker 3:  [677 690 703 716 729 742 755 768 781 794 807 820 833]
    From worker 3:  B = [170 183 196 209 222 235 248 261 274 287 300 313 326]
    From worker 3:  
    From worker 3:  [1 14 27 40 53 66 79 92 105 118 131 144 157]
    From worker 3:  
    From worker 3:  [677 690 703 716 729 742 755 768 781 794 807 820 833]

@yuyichao
Copy link
Contributor

yuyichao commented Apr 1, 2016

After many failed attempt with reproducing in rr I went back to running the test with a sleep after the test fail. Luckily with gdb attach fixed the debugging is much easier...

It turns out that this failure is very reproducible when running with 4 process with all the test, which makes me suspect it's sensitive to which test has been run on the same process before the failing test.

BT:

#3  0xc515b261 in julia_test_cartesian_51260 (A=..., B=<optimized out>) at /home/yuyichao/projects/julia/tests/subarray/test/subarray.jl:104
#4  0xc515b41a in jlcall_test_cartesian_51260 ()
#5  0xf6142e81 in jl_call_method_internal (meth=0xdee90e80, args=0xfffc8eec, nargs=3) at /home/yuyichao/projects/julia/tests/subarray/src/julia_internal.h:69
#6  0xf6149065 in jl_apply_generic (args=0xfffc8eec, nargs=3) at /home/yuyichao/projects/julia/tests/subarray/src/gf.c:1872
#7  0xc26a8917 in julia_runtests_45734 (A=..., I...=0xdd1a8450) at /home/yuyichao/projects/julia/tests/subarray/test/subarray.jl:230
#8  0xf6142e81 in jl_call_method_internal (meth=0xdd11bd50, args=0xfffc9044, nargs=5) at /home/yuyichao/projects/julia/tests/subarray/src/julia_internal.h:69
#9  0xf6149065 in jl_apply_generic (args=0xfffc9044, nargs=5) at /home/yuyichao/projects/julia/tests/subarray/src/gf.c:1872
#10 0xc583d2f3 in julia_runviews_45462 (SB=<error reading variable: Cannot access memory at address 0x0>, indexN=..., indexNN=..., indexNNN=...)
    at /home/yuyichao/projects/julia/tests/subarray/test/subarray.jl:251
#11 0xc583d6b9 in jlcall_runviews_45462 ()
#12 0xf6142e81 in jl_call_method_internal (meth=0xdcbd1350, args=0xfffc9284, nargs=5) at /home/yuyichao/projects/julia/tests/subarray/src/julia_internal.h:69
#13 0xf6149065 in jl_apply_generic (args=0xfffc9284, nargs=5) at /home/yuyichao/projects/julia/tests/subarray/src/gf.c:1872
#14 0xc35cf0f1 in julia_anonymous_45362 () at /home/yuyichao/projects/julia/tests/subarray/test/subarray.jl:328
#15 julia_anonymous_45362 () at <no file>:65535
#16 0xc35cf3eb in jlcall_anonymous_45362 ()
#17 0xf6176ab5 in jl_call_method_internal (meth=0xdd9d59e0, args=0xfffc93a4, nargs=1) at /home/yuyichao/projects/julia/tests/subarray/src/julia_internal.h:69
#18 0xf6178927 in jl_toplevel_eval_flex (e=0xdc1f20c0, fast=1) at /home/yuyichao/projects/julia/tests/subarray/src/toplevel.c:549
#19 0xf614dc14 in jl_parse_eval_all (fname=0xdd9d8f80 "/home/yuyichao/projects/julia/tests/subarray/test/subarray.jl", len=61, 
    content=0x2cdf1d80 "# This file is a part of Julia. License is MIT: http://julialang.org/license\n\nusing Base.Test\n# import Base: ViewIndex, dimsizeexpr, rangetype, merge_indexes, first_index, stride1expr, tailsize\nusing "..., contentlen=13154) at /home/yuyichao/projects/julia/tests/subarray/src/ast.c:793
#20 0xf614de39 in jl_load_file_string (
    text=0x2cdf1d80 "# This file is a part of Julia. License is MIT: http://julialang.org/license\n\nusing Base.Test\n# import Base: ViewIndex, dimsizeexpr, rangetype, merge_indexes, first_index, stride1expr, tailsize\nusing "..., len=13154, filename=0xdd9d8f80 "/home/yuyichao/projects/julia/tests/subarray/test/subarray.jl", 
    namelen=61) at /home/yuyichao/projects/julia/tests/subarray/src/ast.c:821
#21 0xc91ba0d1 in julia_include_string_31320 (txt=..., fname=...) at int.jl:33
#22 julia_include_string_31320 (txt=..., fname=...) at loading.jl:371
#23 0xc91ba15a in jlcall_include_string_31320 ()
#24 0xf6142e81 in jl_call_method_internal (meth=0xd5d03ce0, args=0xfffc9828, nargs=3) at /home/yuyichao/projects/julia/tests/subarray/src/julia_internal.h:69
#25 0xf6149065 in jl_apply_generic (args=0xfffc9828, nargs=3) at /home/yuyichao/projects/julia/tests/subarray/src/gf.c:1872
#26 0xf365c674 in julia_include_from_node1_18178 (_path=...) at loading.jl:420
#27 0xf365c8bf in jlcall_include_from_node1_18178 () from /home/yuyichao/projects/julia/tests/subarray/usr/lib/julia/sys-debug.so
#28 0xf6142e81 in jl_call_method_internal (meth=0xd51e6620, args=0xfffc99d8, nargs=2) at /home/yuyichao/projects/julia/tests/subarray/src/julia_internal.h:69
#29 0xf6149065 in jl_apply_generic (args=0xfffc99d8, nargs=2) at /home/yuyichao/projects/julia/tests/subarray/src/gf.c:1872
#30 0xc91973ea in julia_runtests_31346 (name=...) at util.jl:179
#31 julia_runtests_31346 (name=...) at /home/yuyichao/projects/julia/tests/subarray/test/testdefs.jl:7
#32 0xc9197037 in julia_#16_31345 (t=<error reading variable: Cannot access memory at address 0xfffffdfc>)
    at /home/yuyichao/projects/julia/tests/subarray/test/runtests.jl:36
#33 0xc9197073 in jlcall_#16_31345 ()
#34 0xf6142e81 in jl_call_method_internal (meth=0xd66cdd60, args=0xfffc9b28, nargs=2) at /home/yuyichao/projects/julia/tests/subarray/src/julia_internal.h:69
#35 0xf6149065 in jl_apply_generic (args=0xfffc9b28, nargs=2) at /home/yuyichao/projects/julia/tests/subarray/src/gf.c:1872
#36 0xf614fd6b in jl_apply (args=0xfffc9b28, nargs=2) at /home/yuyichao/projects/julia/tests/subarray/src/julia.h:1278
#37 0xf6151c56 in jl_f__apply (F=0x0, args=0xfffc9cac, nargs=2) at /home/yuyichao/projects/julia/tests/subarray/src/builtins.c:505
---Type <return> to continue, or q <return> to quit---
#38 0xc91d82ca in julia_run_work_thunk_31298 (thunk=..., print_error=<optimized out>) at multi.jl:744
#39 0xc91d80ca in julia_#251_31297 () at multi.jl:1038
#40 julia_#251_31297 () at event.jl:46
#41 0xf6142e81 in jl_call_method_internal (meth=0xd63a2700, args=0xdc4e0e30, nargs=1) at /home/yuyichao/projects/julia/tests/subarray/src/julia_internal.h:69
#42 0xf6149065 in jl_apply_generic (args=0xdc4e0e30, nargs=1) at /home/yuyichao/projects/julia/tests/subarray/src/gf.c:1872
#43 0xf6166110 in jl_apply (args=0xdc4e0e30, nargs=1) at /home/yuyichao/projects/julia/tests/subarray/src/julia.h:1278
#44 0xf6166a44 in start_task () at /home/yuyichao/projects/julia/tests/subarray/src/task.c:249
#45 0x00000000 in ?? ()

asm of the failing function
https://gist.github.com/yuyichao/4ab0a3060332ca803c1be2936f5c1b42

llvm ir after codegen
https://gist.github.com/yuyichao/daba8df0da1f6c037387303069318467

llvm ir after finalizing gc frame
https://gist.github.com/yuyichao/f6052f23db4039b7e09fb43829883794

llvm ir after optimization
https://gist.github.com/yuyichao/2a48b924d6b19f3782b53e043b1a9c13

The tests are running on the yuyichao/julia:yyc/tests/subarray branch and the $pc in the disassembly corresponds to the (ccall(:sleep))

The process is still sleeping so I can take request to check different informations.
@timholy @mbauman

@yuyichao
Copy link
Contributor

yuyichao commented Apr 1, 2016

Oh, and the test failure output

        From worker 3:       * subarray             IA = CartesianIndex{3}((1,1,2))
        From worker 3:  IB = 40
        From worker 3:  A[IA] = 2
        From worker 3:  B[IB] = 340
        From worker 3:  A = [2 15 28 41 54 67 80 93 106 119 132 145 158
        From worker 3:   1 14 27 40 53 66 79 92 105 118 131 144 157
        From worker 3:   5 18 31 44 57 70 83 96 109 122 135 148 161]
        From worker 3:
        From worker 3:  [340 353 366 379 392 405 418 431 444 457 470 483 496
        From worker 3:   339 352 365 378 391 404 417 430 443 456 469 482 495
        From worker 3:   343 356 369 382 395 408 421 434 447 460 473 486 499]
        From worker 3:
        From worker 3:  [678 691 704 717 730 743 756 769 782 795 808 821 834
        From worker 3:   677 690 703 716 729 742 755 768 781 794 807 820 833
        From worker 3:   681 694 707 720 733 746 759 772 785 798 811 824 837]
        From worker 3:  B = [2 15 28 41 54 67 80 93 106 119 132 145 158
        From worker 3:   1 14 27 40 53 66 79 92 105 118 131 144 157
        From worker 3:   5 18 31 44 57 70 83 96 109 122 135 148 161]
        From worker 3:
        From worker 3:  [340 353 366 379 392 405 418 431 444 457 470 483 496
        From worker 3:   339 352 365 378 391 404 417 430 443 456 469 482 495
        From worker 3:   343 356 369 382 395 408 421 434 447 460 473 486 499]
        From worker 3:
        From worker 3:  [678 691 704 717 730 743 756 769 782 795 808 821 834
        From worker 3:   677 690 703 716 729 742 755 768 781 794 807 820 833
        From worker 3:   681 694 707 720 733 746 759 772 785 798 811 824 837]
        From worker 3:  Test failed! Pid: 24912

@colbec
Copy link
Author

colbec commented Apr 1, 2016

@yuyichao suggests in this thread "which makes me suspect it's sensitive to which test has been run on the same process before the failing test". In #15420 I speculated that "Previously the same worker used by arrayops was running staged. If staged left something odd behind would this trip arrayops early?" It may be that I am confusing "worker" with "process" here, but #15420 includes full output in a gist, which may be of some help.

@yuyichao
Copy link
Contributor

yuyichao commented Apr 1, 2016

@vtjnash
Copy link
Member

vtjnash commented Apr 2, 2016

cool. bisected to probably a inference / staged-function / tfunc bug.

@yuyichao
Copy link
Contributor

yuyichao commented Apr 2, 2016

We narrowed down the problem to one of the generated function returning a wrong typed AST that corresponds to a similar but different type CartesianIndex{3} vs CartisianIndex{2}. @vtjnash said this is likely caused by the heuristic around

julia/base/inference.jl

Lines 872 to 896 in 3c9bf14

if limitlength && ls > lsig + 1
if !istopfunction(tm, f, :promote_typeof)
fst = sig.parameters[lsig+1]
allsame = true
# allow specializing on longer arglists if all the trailing
# arguments are the same, since there is no exponential
# blowup in this case.
for i = lsig+2:ls
if sig.parameters[i] != fst
allsame = false
break
end
end
if !allsame
sig = limit_tuple_type_n(sig, lsig+1)
end
end
end
#print(m,"\n")
(_tree, rt) = typeinf_edge(linfo, sig, m[2], sv)
rettype = tmerge(rettype, rt)
if is(rettype,Any)
break
end
end
that is incorrectly widening the type for staged functions. Possibly related or caused by the same bug with #15043 .

vtjnash added a commit that referenced this issue Apr 6, 2016
… growth

unlike regular functions, which can often still be inferred reasonably
by restricting the type signature

fix #15271
vtjnash added a commit that referenced this issue Apr 6, 2016
… growth

unlike regular functions, which can often still be inferred reasonably
by restricting the type signature

fix #15271
@vtjnash
Copy link
Member

vtjnash commented Jul 19, 2016

I think typemap & associated changes should have addressed this. Indeed, I haven't seen this recently.

@vtjnash vtjnash closed this as completed Jul 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
heisenbug This bug occurs unpredictably priority This should be addressed urgently test This change adds or pertains to unit tests
Projects
None yet
Development

No branches or pull requests

8 participants