Skip to content
This repository was archived by the owner on Jan 22, 2024. It is now read-only.

cannot Classify image #108

Closed
DrSensor opened this issue Jun 10, 2016 · 14 comments
Closed

cannot Classify image #108

DrSensor opened this issue Jun 10, 2016 · 14 comments
Labels

Comments

@DrSensor
Copy link

In docker registry nvidia/digits https://hub.docker.com/r/nvidia/digits

when i run it, it works normally but it seem libdc1394 is missing

2016-06-10 03:00:28 [9] [INFO] Starting gunicorn 17.5
libdc1394 error: Failed to initialize libdc1394
2016-06-10 03:00:28 [9] [DEBUG] Arbiter booted
2016-06-10 03:00:28 [9] [INFO] Listening at: http://0.0.0.0:34448 (9)
2016-06-10 03:00:28 [9] [INFO] Using worker: socketio.sgunicorn.GeventSocketIOWorker

And after doing training (its work fine) i try to Classify One but it say "The connection was reset" in my browser.

2016-06-10 03:01:26 [52] [INFO] Booting worker with pid: 52
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0610 03:02:23.285483 52 common.cpp:110] Cannot create Cublas handle. Cublas won't be available.
E0610 03:02:23.288187 52 common.cpp:117] Cannot create Curand generator. Curand won't be available.
E0610 03:02:23.290375 52 common.cpp:121] Cannot create cuDNN handle. cuDNN won't be available.
F0610 03:02:23.331511 52 syncedmem.hpp:19] Check failed: error == cudaSuccess (3 vs. 0) initialization error
*** Check failure stack trace: ***

@flx42
Copy link
Member

flx42 commented Jun 10, 2016

Yeah, I encountered this issue recently too.
@lukeyeager @gheinrich: is this a know issue with DIGITS 3.0.0?

We really need to upload a newer version of DIGITS, this is maybe the cause.
If we don't have a new deb soon I will submit a Dockerfile that builds from source.

@gheinrich
Copy link

I don't think this is an issue in DIGITS 3.0 per se but something to do with the conflicting packages - see NVIDIA/DIGITS#801 (comment)

I'm surprised training works fine. @Cimenx did you update your image after you initially trained the model?

@flx42
Copy link
Member

flx42 commented Jun 10, 2016

No, we have the 7.5 package, because I pin the cuDNN version in my Dockerfile:
https://github.com/NVIDIA/nvidia-docker/blob/master/ubuntu-14.04/cuda/7.5/runtime/cudnn5/Dockerfile#L10

You can also verify it inside the container:

$ docker run -ti --entrypoint=bash nvidia/digits
# dpkg -l | grep cudnn
ii  libcudnn5                          5.0.5-1+cuda7.5                         amd64        cuDNN runtime libraries

@flx42
Copy link
Member

flx42 commented Jun 10, 2016

@Cimenx on DockerHub you mentioned the previous version was working? It's probably related to the newer caffe/cudnn. If so, I will revert and push a different tag for a newer version of DIGITS.

@flx42
Copy link
Member

flx42 commented Jun 10, 2016

On DIGITS 3.0.0, the problem only exists with digits-server, it works fine with digits-devserver.
But with 3.3.0, both versions are working fine.

@flx42
Copy link
Member

flx42 commented Jun 11, 2016

After an inverted bisect, the problem was fixed in NVIDIA/DIGITS@9dba452

@flx42
Copy link
Member

flx42 commented Jun 11, 2016

@gheinrich @lukeyeager I was able to reproduce the problems on two machines, without Docker, just using the packages from the repo. So it might be a DIGITS/packaging problem.

@DrSensor
Copy link
Author

DrSensor commented Jun 13, 2016

@gheinrich No, i'm training in a new fresh updated container.
@flx42 Yes, it was worked before i updated the docker image.
(FYI : i also had submited issue in here Kaixhin/dockerfiles#14 (comment) and the docker image that he build is working fine with me now)

@flx42
Copy link
Member

flx42 commented Jun 13, 2016

@Cimenx: sure, the master branch of DIGITS doesn't have this problem, that's why.

@flx42
Copy link
Member

flx42 commented Jun 13, 2016

@Cimenx did you update the NVIDIA driver recently?

@3XX0 3XX0 added the bug label Jun 14, 2016
@lukeyeager
Copy link
Member

Hi @Cimenx , DIGITS dev here. I tried a simple training + inference with the nvidia/digits image and couldn't reproduce your issue.

  1. Are you training a model while you're trying to run inference? Maybe you're just running out of memory. That's something we fixed in Inference jobs DIGITS#573 - now you can only run inference on a GPU which isn't already in use.
  2. Do you have an old GPU on your system? Maybe DIGITS is trying to use the wrong GPU.
  3. What is the version number of your nvidia driver?

@flx42
Copy link
Member

flx42 commented Jun 15, 2016

We are still investigating the bug for DIGITS 3.0.
But now you can also use DIGITS 3.3 and 4.0 on the DockerHub, the problem doesn't exist in those versions.

@flx42
Copy link
Member

flx42 commented Jun 16, 2016

We are now tracking this bug in NVIDIA/DIGITS#845
And now you can use newer versions of DIGITS on the Docker Hub, so I'm closing this issue.

@pierrelzw
Copy link

@Cimenx did you update the NVIDIA driver recently?

I recently update Nvidia driver and then came across this problem ("Cannot create Cublas handle....") . I've tried re-pull NVIDIA/DIGITS images, but it doesn't work. I'm new to docker. Do you have any solution?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants