Skip to content
This repository was archived by the owner on Feb 13, 2025. It is now read-only.

early program exit with xla mode under linux #704

Closed
brettkoonce opened this issue Nov 25, 2020 · 12 comments
Closed

early program exit with xla mode under linux #704

brettkoonce opened this issue Nov 25, 2020 · 12 comments
Assignees

Comments

@brettkoonce
Copy link
Contributor

Been seeing this for a while, for what it's worth, but here's a reproducible testcase:

Ubuntu 18.04, cuda 10.2, 0.12 build to match. swift-models, master checkout:

swift run LeNet-MNIST

...
Epoch 12/12
468/468 [==============================] - loss: 0.0225 - accuracy: 0.9933
79/79 [==============================] - loss: 0.0364 - accuracy: 0.9884
pure virtual method called
terminate called without an active exception
Aborted (core dumped)

Above is using xla, when switching to Device.defaultTFEager things exit cleanly (eg logger callback completes).

@BradLarson
Copy link
Contributor

Does this always crash for you on the start of the second epoch, or is it somewhat random? On the Jetson devices, I've seen a similar crash, but it happened randomly and oddly would seem to go away when I was running tegrastats. Is this more reliable of a crasher?

@brettkoonce
Copy link
Contributor Author

not on the second epoch, but after the very last one (12 in the mnist demo above). every time for me with the above (just did again) + as well as the mnist xla demo i made for the presentation (no logger callback there, fwiw).

@texasmichelle
Copy link
Member

XLA runs through cleanly for me on the tensorflow-0.12 branch with the
swift-latest-cu110-ubuntu-1804 image, on both the included toolchain and swift-tensorflow-RELEASE-0.12-cuda11.0-cudnn8-ubuntu18.04. I'll try running through your config as well.

@texasmichelle
Copy link
Member

I can't get this running on current main in swift-models, which is not unexpected. Release toolchains are designed to work with the corresponding release branch. @brettkoonce can you confirm that you're seeing this on tensorflow-0.12? Or whether this occurs on main with the latest nightly?

@brettkoonce
Copy link
Contributor Author

@texasmichelle i can't reproduce, it's working here for me now. was using swift-tensorflow-RELEASE-0.12-cuda11.0-cudnn8-ubuntu18.04.tar.gz + your image. sorry for the confusion.

@brettkoonce
Copy link
Contributor Author

still getting this error here with my xla mnist demo, cuda 10.2, ubuntu 18.04. going to purge my system soon so will revisit this after.

@BradLarson
Copy link
Contributor

@brettkoonce - I'm now seeing the same with the ResNet-CIFAR10 example on CUDA 11 (the LeNet-MNIST one runs fine for me). Could be CUDA-related, with failures at different points for different versions. If I can get this to reproduce locally, I'll start debugging to see what this could be related to.

@BradLarson
Copy link
Contributor

In the reproducer I have here, the stack trace I'm getting seems to show the segfault coming from within cuDNN 8. If so, this might be a cuDNN thing instead of something we can impact on our end. Our CUDA 10.2 toolchains link against cuDNN 7, so that might explain why this is happening for different models. I'll keep digging to see if there's something in particular that triggers this and determine if there's a workaround or a version of cuDNN that doesn't exhibit this behavior.

@brettkoonce
Copy link
Contributor Author

brettkoonce commented Dec 23, 2020

@BradLarson that would make sense with respect to cuda versions + my system being slightly different than what you all have. i'm going to make the jump to 11.0 + cudnn8, fwiw, don't really want to spend time hunting 10.2 gremlins any more anyhow. Will have my mnist demo online shortly so you should be able to test it using that as well.

@brettkoonce
Copy link
Contributor Author

@brettkoonce
Copy link
Contributor Author

Yeah, seems to be gone post-upgrade, will chalk up to CUDNN wierdness.

@BradLarson
Copy link
Contributor

Great to hear. cuDNN can be a bit of a black box, unfortunately. We also have new toolchains that I'll be talking about on Friday that may or may not impact this (if it's pure cuDNN, probably not).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants