CUDA GPU terminates process at random instances

cmccabe · December 18, 2016, 10:43am

I am trying to start troubleshooting an error on a virtual server that uses the ubuntu 14.04 OS. Basically what happens (seeming random) is that the GPU stops processing and terminates. What Imean by seeming random is that for 3 runs there is no error then on run 4 the error appears. It has happend 4 times now and about the only consistency is that it appears to error at the same time - cycle 21 (as indicated by the log not included). If I reboot the GPU starts up again and processes normal.
Are there any commands/recommendations that might help me figure out what is going on? Thank you :).

Error:

CUDA: gpuDeviceConfig: device added for evaluation: 0:GeForce GTX 970 v5.2
3.99982GB
CUDA: gpuDeviceConfig: minimum compute version used for pipeline: 2.0
CUDA 0: gpuDeviceConfig::initDeviceContexts: Creating Context and Constant
memory on device with id: 0
terminate called after throwing an instance of 'cudaExecutionException'

+----------------------------------------
 | ** CUDA ERROR! **
 | Error: 46
 | Msg: all CUDA-capable devices are busy or unavailable
 | File: 
cudaWrapper.cpp
 | Line: 127
 +----------------------------------------
  what():  CUDA EXCEPTION: Error occurred during job Execution!

Don_Cragun · December 19, 2016, 9:11am

I know very little about GPU programming, but from the error message I would assume that you are asking the GPU to start a new thread when the resources needed to run that thread are not available.

What does your documentation for your GeForce GTX 970 v5.2 say error code 46 means? What are you running on your GPU?

What is cycle 21 in your GPU code doing?

cmccabe · December 20, 2016, 8:21am

Error 46 seems to be a CUDA API error. The GPU runs data-intensive analysis utilizing hpc clustering and parallel-processing .

File:
     /sw_results/R_2016_12_05_13_30_48_user_S5-00580-17-Medexome/X0_Y0/acq_0020.
     dat
     FileLoadWorker: ImageProcessing time for flow 21: 0.65(ld=0.39 pin=0.05
     cnc=0.11 xt=0.09 sem=0.00 cache=0.06) sec 16:07:13
     File:
     /sw_results/R_2016_12_05_13_30_48_user_S5-00580-17-Medexome/X0_Y0/acq_0021.
     dat
     CUDA: gpuDeviceConfig: device added for evaluation: 0:GeForce GTX 970 v5.2
     3.99982GB
     CUDA: gpuDeviceConfig: minimum compute version used for pipeline: 2.0
     CUDA 0: gpuDeviceConfig::initDeviceContexts: Creating Context and Constant
     memory on device with id: 0
     terminate called after throwing an instance of 'cudaExecutionException'

It seems the CUDA exception was thrown in flow 21 and the GPU was interrupted. Is there a way that I may be able to figure out the cause of that interruption? Thank you :).