-
Notifications
You must be signed in to change notification settings - Fork 149
early program exit with xla mode under linux #704
Comments
Does this always crash for you on the start of the second epoch, or is it somewhat random? On the Jetson devices, I've seen a similar crash, but it happened randomly and oddly would seem to go away when I was running |
not on the second epoch, but after the very last one (12 in the mnist demo above). every time for me with the above (just did again) + as well as the mnist xla demo i made for the presentation (no logger callback there, fwiw). |
XLA runs through cleanly for me on the |
I can't get this running on current |
@texasmichelle i can't reproduce, it's working here for me now. was using swift-tensorflow-RELEASE-0.12-cuda11.0-cudnn8-ubuntu18.04.tar.gz + your image. sorry for the confusion. |
still getting this error here with my xla mnist demo, cuda 10.2, ubuntu 18.04. going to purge my system soon so will revisit this after. |
@brettkoonce - I'm now seeing the same with the ResNet-CIFAR10 example on CUDA 11 (the LeNet-MNIST one runs fine for me). Could be CUDA-related, with failures at different points for different versions. If I can get this to reproduce locally, I'll start debugging to see what this could be related to. |
In the reproducer I have here, the stack trace I'm getting seems to show the segfault coming from within cuDNN 8. If so, this might be a cuDNN thing instead of something we can impact on our end. Our CUDA 10.2 toolchains link against cuDNN 7, so that might explain why this is happening for different models. I'll keep digging to see if there's something in particular that triggers this and determine if there's a workaround or a version of cuDNN that doesn't exhibit this behavior. |
@BradLarson that would make sense with respect to cuda versions + my system being slightly different than what you all have. i'm going to make the jump to 11.0 + cudnn8, fwiw, don't really want to spend time hunting 10.2 gremlins any more anyhow. Will have my mnist demo online shortly so you should be able to test it using that as well. |
Yeah, seems to be gone post-upgrade, will chalk up to CUDNN wierdness. |
Great to hear. cuDNN can be a bit of a black box, unfortunately. We also have new toolchains that I'll be talking about on Friday that may or may not impact this (if it's pure cuDNN, probably not). |
Been seeing this for a while, for what it's worth, but here's a reproducible testcase:
Ubuntu 18.04, cuda 10.2, 0.12 build to match. swift-models, master checkout:
Above is using xla, when switching to Device.defaultTFEager things exit cleanly (eg logger callback completes).
The text was updated successfully, but these errors were encountered: