early program exit with xla mode under linux #704

brettkoonce · 2020-11-25T17:10:02Z

Been seeing this for a while, for what it's worth, but here's a reproducible testcase:

Ubuntu 18.04, cuda 10.2, 0.12 build to match. swift-models, master checkout:

swift run LeNet-MNIST

...
Epoch 12/12
468/468 [==============================] - loss: 0.0225 - accuracy: 0.9933
79/79 [==============================] - loss: 0.0364 - accuracy: 0.9884
pure virtual method called
terminate called without an active exception
Aborted (core dumped)

Above is using xla, when switching to Device.defaultTFEager things exit cleanly (eg logger callback completes).

The text was updated successfully, but these errors were encountered:

BradLarson · 2020-11-30T19:55:25Z

Does this always crash for you on the start of the second epoch, or is it somewhat random? On the Jetson devices, I've seen a similar crash, but it happened randomly and oddly would seem to go away when I was running tegrastats. Is this more reliable of a crasher?

brettkoonce · 2020-11-30T20:02:14Z

not on the second epoch, but after the very last one (12 in the mnist demo above). every time for me with the above (just did again) + as well as the mnist xla demo i made for the presentation (no logger callback there, fwiw).

texasmichelle · 2020-12-22T17:43:11Z

XLA runs through cleanly for me on the tensorflow-0.12 branch with the
swift-latest-cu110-ubuntu-1804 image, on both the included toolchain and swift-tensorflow-RELEASE-0.12-cuda11.0-cudnn8-ubuntu18.04. I'll try running through your config as well.

texasmichelle · 2020-12-22T17:51:18Z

I can't get this running on current main in swift-models, which is not unexpected. Release toolchains are designed to work with the corresponding release branch. @brettkoonce can you confirm that you're seeing this on tensorflow-0.12? Or whether this occurs on main with the latest nightly?

brettkoonce · 2020-12-23T02:11:27Z

@texasmichelle i can't reproduce, it's working here for me now. was using swift-tensorflow-RELEASE-0.12-cuda11.0-cudnn8-ubuntu18.04.tar.gz + your image. sorry for the confusion.

brettkoonce · 2020-12-23T02:17:56Z

still getting this error here with my xla mnist demo, cuda 10.2, ubuntu 18.04. going to purge my system soon so will revisit this after.

BradLarson · 2020-12-23T14:38:38Z

@brettkoonce - I'm now seeing the same with the ResNet-CIFAR10 example on CUDA 11 (the LeNet-MNIST one runs fine for me). Could be CUDA-related, with failures at different points for different versions. If I can get this to reproduce locally, I'll start debugging to see what this could be related to.

BradLarson · 2020-12-23T16:35:20Z

In the reproducer I have here, the stack trace I'm getting seems to show the segfault coming from within cuDNN 8. If so, this might be a cuDNN thing instead of something we can impact on our end. Our CUDA 10.2 toolchains link against cuDNN 7, so that might explain why this is happening for different models. I'll keep digging to see if there's something in particular that triggers this and determine if there's a workaround or a version of cuDNN that doesn't exhibit this behavior.

brettkoonce · 2020-12-23T16:58:22Z

@BradLarson that would make sense with respect to cuda versions + my system being slightly different than what you all have. i'm going to make the jump to 11.0 + cudnn8, fwiw, don't really want to spend time hunting 10.2 gremlins any more anyhow. Will have my mnist demo online shortly so you should be able to test it using that as well.

brettkoonce · 2020-12-31T21:47:33Z

Code up here:

https://github.com/Apress/convolutional-neural-networks-with-swift-for-tensorflow/

brettkoonce · 2021-01-06T17:33:25Z

Yeah, seems to be gone post-upgrade, will chalk up to CUDNN wierdness.

BradLarson · 2021-01-06T22:04:16Z

Great to hear. cuDNN can be a bit of a black box, unfortunately. We also have new toolchains that I'll be talking about on Friday that may or may not impact this (if it's pure cuDNN, probably not).

texasmichelle assigned BradLarson Dec 2, 2020

brettkoonce mentioned this issue Dec 22, 2020

Support Ubuntu 20.04 tensorflow/swift#512

Open

brettkoonce closed this as completed Jan 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

early program exit with xla mode under linux #704

early program exit with xla mode under linux #704

brettkoonce commented Nov 25, 2020

BradLarson commented Nov 30, 2020

brettkoonce commented Nov 30, 2020

texasmichelle commented Dec 22, 2020

texasmichelle commented Dec 22, 2020

brettkoonce commented Dec 23, 2020

brettkoonce commented Dec 23, 2020

BradLarson commented Dec 23, 2020

BradLarson commented Dec 23, 2020

brettkoonce commented Dec 23, 2020 •

edited

Loading

brettkoonce commented Dec 31, 2020

brettkoonce commented Jan 6, 2021

BradLarson commented Jan 6, 2021

early program exit with xla mode under linux #704

early program exit with xla mode under linux #704

Comments

brettkoonce commented Nov 25, 2020

BradLarson commented Nov 30, 2020

brettkoonce commented Nov 30, 2020

texasmichelle commented Dec 22, 2020

texasmichelle commented Dec 22, 2020

brettkoonce commented Dec 23, 2020

brettkoonce commented Dec 23, 2020

BradLarson commented Dec 23, 2020

BradLarson commented Dec 23, 2020

brettkoonce commented Dec 23, 2020 • edited Loading

brettkoonce commented Dec 31, 2020

brettkoonce commented Jan 6, 2021

BradLarson commented Jan 6, 2021

brettkoonce commented Dec 23, 2020 •

edited

Loading