Detect CUDA driver version in subprocess #11667

XuehaiPan · 2022-07-28T11:53:59Z

Description

Followed by #11626, see #11626 (comment).

The environment variable CUDA_VISIBLE_DEVICES can cause libcuda.cuInit() fails (returns non-zero). The CUDA driver detector detects the driver version in the current process. The user-defined environment variable will let it fail.

For example:

An empty CUDA_VISIBLE_DEVICES will cause libcuda.cuInit() returns CUDA_ERROR_NO_DEVICE (100).

$ CUDA_VISIBLE_DEVICES='' python3 -c 'from conda.plugins.virtual_packages import cuda; print(cuda.cuda_version())'
None

An invalid CUDA_VISIBLE_DEVICES will cause libcuda.cuInit() returns CUDA_ERROR_INVALID_DEVICE (101).

$ CUDA_VISIBLE_DEVICES='0,0' python3 -c 'from conda.plugins.virtual_packages import cuda; print(cuda.cuda_version())'
None

In this PR, we detect the CUDA driver version in a spawned subprocess, which unset the environment variable CUDA_VISIBLE_DEVICES before calling libcuda.cuInit().

$ CUDA_VISIBLE_DEVICES='' python3 -c 'from conda.plugins.virtual_packages import cuda; print(cuda.cuda_version())'
11.7

$ CUDA_VISIBLE_DEVICES='0,0' python3 -c 'from conda.plugins.virtual_packages import cuda; print(cuda.cuda_version())'
11.7

Checklist - did you ...

Add a file to the news directory (using the template) for the next release's release notes?
~~Add / update necessary tests?~~
~~Add / update outdated documentation?~~

conda/plugins/virtual_packages/cuda.py

jezdez · 2022-12-19T21:27:37Z

@chenghlee @jakirkham Do you happen to know who we could ask to review this?

jakirkham · 2022-12-19T22:08:17Z

conda/plugins/virtual_packages/cuda.py

+    # Do not inherit file descriptors and handles from the parent process
+    # The `fork` start method should be considered unsafe as it can lead to
+    # crashes of the subprocess
+    ctx = mp.get_context("spawn")
+    queue = ctx.SimpleQueue()


Is there a way to do this without starting a subprocess? This feels kind of brittle

Is there a way to do this without starting a subprocess?

@jakirkham Yes, another approach is to unset "CUDA_VISIBLE_DEVICES" in os.environ before calling cuInit() in the main process.

# Empty `CUDA_VISIBLE_DEVICES` can cause `cuInit()` returns `CUDA_ERROR_NO_DEVICE` # Invalid `CUDA_VISIBLE_DEVICES` can cause `cuInit()` returns `CUDA_ERROR_INVALID_DEVICE` os.environ.pop("CUDA_VISIBLE_DEVICES", None)

The point here using subprocess is to separate the .so handle (fds) namespaces. Although it adds more complexity.

We never release the .so handle once we load libcuda.so. The conda program will always keep the NVIDIA device (/dev/nvidia-uvm) and kernel module (nvidia_uvm) in use while solving the environment. A benefit of using a subprocess here is we can release the fd as the subprocess ends. We use caches to memorize the result, so the subprocess will start only once.

watch 'lsof -t /dev/nvidia* | xargs -r -L 1 ps -o pid=,user=,command= -p'

Currently, conda is a standalone program, and it is not meant to be imported. Unset and CUDA_VISIBLE_DEVICES and call cuInit() in the main process seems safe here. But if someone calls cuInit() with invalid CUDA_VISIBLE_DEVICES in advance. Then unsetting CUDA_VISIBLE_DEVICES and call cuInit() in conda internals will always fail. The .so handlers and fds are messed up in the main process.

jakirkham · 2022-12-19T22:11:10Z

Think @seibert wrote the original CUDA driver version detection logic.

@kkraus14 and @jaimergp may also have thoughts (Jaime in particular has gone down the CUDA Windows rabbit hole before).

Am happy to look as well. Though this may be something that occurs in the new year.

kkraus14 · 2022-12-20T19:20:15Z

Instead of using the CUDA driver API to accomplish this, we could use the nvml library which doesn't have the same restrictions with regards to CUDA_VISIBLE_DEVICES.

Some example code:

from ctypes import *

libnvml = CDLL("libnvidia-ml.so.1")  # Will need to handle linux vs windows logic here
status = libnvml.nvmlInit_v2()  # Should check the result of status here to ensure it succeeded

cudaDriverVersion = c_int()
status = libnvml.nvmlSystemGetCudaDriverVersion_v2(byref(cudaDriverVersion))  # Should check the result of status here to ensure it succeeded

# cudaDriverVersion is now a ctypes c_int with the CUDA driver version

XuehaiPan · 2022-12-22T13:27:26Z

we could use the nvml library which doesn't have the same restrictions with regards to CUDA_VISIBLE_DEVICES.

The NVML library publishes new APIs with _vN suffix (e.g., nvmlInit_v2). The newly versioned APIs (symbols in libnvidia-ml.so) may not exist on the user side if they use old drivers.

kkraus14 · 2022-12-23T18:23:18Z

nvmlInit_v2 was added in the CUDA 5.5 release of NVML so this should be fine. We can move to use nvmlInit if we want to be even more conservative (which is silly IMO).

nvmlSystemGetCudaDriverVersion_v2 was added in CUDA 10 release of NVML so maybe we should avoid using this. It looks like nvmlSystemGetCudaDriverVersion was added between CUDA 8 and CUDA 9 in driver 378 (NVIDIA/nvidia-settings@c9e5453) so that would be pretty good as well.

A potentially bad idea: it looks like you can call cuDriverGetVersion without calling cuInit first and it still returns the correct version without error. This is definitely relying on implementation details of the CUDA driver and could change without notice in the future.

jezdez · 2023-01-10T10:14:08Z

Hey all! Thank you @XuehaiPan for continuing to push for change.

I'm trying to understand the correctness of this patch given that we plan to tag conda 23.1.0 January 16. Is this ready?

On a different notw, there is a larger discussion we could have: now that this virtual package is implemented as a plugin, we could release this in a separate conda-cuda-virtual-package (or similar) package, maybe co-maintained with folks from Nvidia? @jakirkham @kkraus14 Any opinions on this?

XuehaiPan · 2023-01-10T11:55:26Z

Is this ready?

@jezdez I think it's ready, but it's depending on the opinions of the code reviewers.

Let me summarize the changes in this PR:

Re-add the libcuda.so search path for WSL support.
Put the whole CUDA detection function in a separate subprocess:
1. Main process: Start a new subprocess with the spawn start method.
2. Subprocess: Detect the CUDA version in the subprocess and put the result in a SimpleQueue.
  - New change: unset CUDA_VISIBLE_DEVICES before initializing the CUDA driver context.
3. Main process: receive the result from the subprocess and return the result to the function caller.

Rationale

Loading a .so file in the main process is not safe. If the driver library is corrupted, then the conda program always fails (Segmentation Fault). Putting the CUDA detector in a daemon process, if it fails, the caller can return None rather than crash.
Better process environment management. The CUDA driver library requires valid CUDA_VISIBLE_DEVICES. Using subprocesses, we can separate the environment rather than mess them together.
Early release of the CUDA library handle. We never release the .so library handle once we load it. If we load the CUDA library in the main process, it will cause the NVIDIA device and kernel module in use during the whole conda program lifetime. With subprocess, we can release the handle as soon as we get the result and exit the subprocess.

conda/plugins/virtual_packages/cuda.py

kkraus14 · 2023-01-10T21:12:12Z

It feels really funky to unset the CUDA_VISIBLE_DEVICES environment variable in a spawned subprocess. Almost feels like this is conda overstepping its bounds.

We could use NVML and avoid having to mess with the environment variable and then decide if we want to use the subprocess to more nicely load and unload the library / isolate it from crashing the main conda process.

XuehaiPan · 2023-01-11T06:34:35Z

Almost feels like this is conda overstepping its bounds.

@kkraus14 I agree. I can open another PR to refactor with the NVML library.

One concern is the version handling and backward compatibility guarantee of the NVML library functions. Different NVML libraries ships different symbols. Newer versioned functions may have extra arguments, and the new and the old ones sometimes are not inter-replaceable.

That may require extra maintenance work. For example, check whether the library is new enough to have newer versioned functions with hasattr(lib, '<symbol_vN>'), otherwise fallback to _v(N-1). Then call the function with proper arguments (may be different between versions). Every time when a new NVIDIA driver is released, we will also need to update the version handling again.

#define NVML_INIT_FLAG_NO_GPUS      1   //!< Don't fail nvmlInit() when no GPUs are found
#define NVML_INIT_FLAG_NO_ATTACH    2   //!< Don't attach GPUs

nvmlReturn_t DECLDIR nvmlInit(void);
nvmlReturn_t DECLDIR nvmlInit_v2(void);
nvmlReturn_t DECLDIR nvmlInit_v3(unsigned int flags);
nvmlReturn_t DECLDIR nvmlInitWithFlags(unsigned int flags);
nvmlReturn_t DECLDIR nvmlShutdown(void);

nvmlReturn_t DECLDIR nvmlSystemGetCudaDriverVersion(int *cudaDriverVersion);
nvmlReturn_t DECLDIR nvmlSystemGetCudaDriverVersion_v2(int *cudaDriverVersion);

nvml.h as of CUDA 11.8

/***************************************************************************************************/
/** @defgroup nvmlInitializationAndCleanup Initialization and Cleanup
 * This chapter describes the methods that handle NVML initialization and cleanup.
 * It is the user's responsibility to call \ref nvmlInit_v2() before calling any other methods, and
 * nvmlShutdown() once NVML is no longer being used.
 *  @{
 */
/***************************************************************************************************/

#define NVML_INIT_FLAG_NO_GPUS      1   //!< Don't fail nvmlInit() when no GPUs are found
#define NVML_INIT_FLAG_NO_ATTACH    2   //!< Don't attach GPUs

/**
 * Initialize NVML, but don't initialize any GPUs yet.
 *
 * \note nvmlInit_v3 introduces a "flags" argument, that allows passing boolean values
 *       modifying the behaviour of nvmlInit().
 * \note In NVML 5.319 new nvmlInit_v2 has replaced nvmlInit"_v1" (default in NVML 4.304 and older) that
 *       did initialize all GPU devices in the system.
 *
 * This allows NVML to communicate with a GPU
 * when other GPUs in the system are unstable or in a bad state.  When using this API, GPUs are
 * discovered and initialized in nvmlDeviceGetHandleBy* functions instead.
 *
 * \note To contrast nvmlInit_v2 with nvmlInit"_v1", NVML 4.304 nvmlInit"_v1" will fail when any detected GPU is in
 *       a bad or unstable state.
 *
 * For all products.
 *
 * This method, should be called once before invoking any other methods in the library.
 * A reference count of the number of initializations is maintained.  Shutdown only occurs
 * when the reference count reaches zero.
 *
 * @return
 *         - \ref NVML_SUCCESS                   if NVML has been properly initialized
 *         - \ref NVML_ERROR_DRIVER_NOT_LOADED   if NVIDIA driver is not running
 *         - \ref NVML_ERROR_NO_PERMISSION       if NVML does not have permission to talk to the driver
 *         - \ref NVML_ERROR_UNKNOWN             on any unexpected error
 */
nvmlReturn_t DECLDIR nvmlInit_v2(void);

/**
 * nvmlInitWithFlags is a variant of nvmlInit(), that allows passing a set of boolean values
 *       modifying the behaviour of nvmlInit().
 *       Other than the "flags" parameter it is completely similar to \ref nvmlInit_v2.
 *
 * For all products.
 *
 * @param flags                                 behaviour modifier flags
 *
 * @return
 *         - \ref NVML_SUCCESS                   if NVML has been properly initialized
 *         - \ref NVML_ERROR_DRIVER_NOT_LOADED   if NVIDIA driver is not running
 *         - \ref NVML_ERROR_NO_PERMISSION       if NVML does not have permission to talk to the driver
 *         - \ref NVML_ERROR_UNKNOWN             on any unexpected error
 */
nvmlReturn_t DECLDIR nvmlInitWithFlags(unsigned int flags);

/**
 * Shut down NVML by releasing all GPU resources previously allocated with \ref nvmlInit_v2().
 *
 * For all products.
 *
 * This method should be called after NVML work is done, once for each call to \ref nvmlInit_v2()
 * A reference count of the number of initializations is maintained.  Shutdown only occurs
 * when the reference count reaches zero.  For backwards compatibility, no error is reported if
 * nvmlShutdown() is called more times than nvmlInit().
 *
 * @return
 *         - \ref NVML_SUCCESS                 if NVML has been properly shut down
 *         - \ref NVML_ERROR_UNINITIALIZED     if the library has not been successfully initialized
 *         - \ref NVML_ERROR_UNKNOWN           on any unexpected error
 */
nvmlReturn_t DECLDIR nvmlShutdown(void);

/** @} */

/**
 * Retrieves the version of the CUDA driver.
 *
 * For all products.
 *
 * The CUDA driver version returned will be retreived from the currently installed version of CUDA.
 * If the cuda library is not found, this function will return a known supported version number.
 *
 * @param cudaDriverVersion                    Reference in which to return the version identifier
 *
 * @return
 *         - \ref NVML_SUCCESS                 if \a cudaDriverVersion has been set
 *         - \ref NVML_ERROR_INVALID_ARGUMENT  if \a cudaDriverVersion is NULL
 */
nvmlReturn_t DECLDIR nvmlSystemGetCudaDriverVersion(int *cudaDriverVersion);

/**
 * Retrieves the version of the CUDA driver from the shared library.
 *
 * For all products.
 *
 * The returned CUDA driver version by calling cuDriverGetVersion()
 *
 * @param cudaDriverVersion                    Reference in which to return the version identifier
 *
 * @return
 *         - \ref NVML_SUCCESS                  if \a cudaDriverVersion has been set
 *         - \ref NVML_ERROR_INVALID_ARGUMENT   if \a cudaDriverVersion is NULL
 *         - \ref NVML_ERROR_LIBRARY_NOT_FOUND  if \a libcuda.so.1 or libcuda.dll is not found
 *         - \ref NVML_ERROR_FUNCTION_NOT_FOUND if \a cuDriverGetVersion() is not found in the shared library
 */
nvmlReturn_t DECLDIR nvmlSystemGetCudaDriverVersion_v2(int *cudaDriverVersion);

/**
 * Macros for converting the CUDA driver version number to Major and Minor version numbers.
 */
#define NVML_CUDA_DRIVER_VERSION_MAJOR(v) ((v)/1000)
#define NVML_CUDA_DRIVER_VERSION_MINOR(v) (((v)%1000)/10)

kkraus14 · 2023-01-11T06:51:38Z

Given that none of the older functions are deprecated or removed we could just use them and hold off on any kind of versioning until needed.

See my comment here: #11667 (comment)

Driver 378, which was when nvmlSystemGetCudaDriverVersion was included, released ~6 years ago so I think it's pretty reasonable that we don't need to be able to detect the CUDA version on drivers older than that. Besides, the oldest version of the CUDA toolkit packaged as a conda package is 8.0 and its only packaged for Windows on the anaconda channel. The oldest version packaged on conda-forge is 9.2.

kkraus14

LGTM other than refactoring to use nvml instead of the driver library, but we can do that in a follow up

jezdez

Some minor code fixes.

conda/plugins/virtual_packages/cuda.py

jezdez · 2023-01-20T20:42:12Z

@XuehaiPan @kkraus14 Before I forget, do you feel like it's important to file a ticket for the refactoring to use the nvml library?

Alternatively, does anyone know how likely it would be for us to extract the cuda virtual package into a separate repo? Any takers from Nvidia?

jakirkham · 2023-01-20T21:42:14Z

It's probably a good idea to have an issue

We could look into a plugin, but that won't be for a while

Highest priority for me is getting new packages together ( conda-forge/staged-recipes#21382 ) and starting CUDA 12 builds

leofang · 2023-01-24T21:56:19Z

NVML should be the way to go (e.g. see here). Sorry to see/raise this late, but this feels a bit awkward to me 😅 Like Keith said, the needed NVML symbols have existed for very long time, it should be preferred for detecting the driver version without initializing the context.

XuehaiPan requested a review from a team as a code owner July 28, 2022 11:54

conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label Jul 28, 2022

travishathaway added the source::community catch-all for issues filed by community members label Jul 28, 2022

XuehaiPan force-pushed the cuda-detect branch 2 times, most recently from fc1cda9 to 6efb65d Compare July 30, 2022 09:09

XuehaiPan force-pushed the cuda-detect branch 2 times, most recently from 896dbe6 to d2f35a9 Compare November 30, 2022 13:10

XuehaiPan commented Nov 30, 2022

View reviewed changes

conda/plugins/virtual_packages/cuda.py Show resolved Hide resolved

XuehaiPan force-pushed the cuda-detect branch 2 times, most recently from 65176d2 to 4412dd0 Compare December 7, 2022 13:02

jezdez added this to the 23.1.0 milestone Dec 19, 2022

jezdez requested a review from chenghlee December 19, 2022 21:26

jakirkham reviewed Dec 19, 2022

View reviewed changes

XuehaiPan force-pushed the cuda-detect branch from 4412dd0 to 3cbf864 Compare December 28, 2022 12:55

jezdez self-assigned this Jan 9, 2023

XuehaiPan added 2 commits January 10, 2023 19:36

Detect CUDA driver version in subprocess

62f0c82

Add news

4b902ab

XuehaiPan force-pushed the cuda-detect branch from 3cbf864 to 4b902ab Compare January 10, 2023 11:37

kkraus14 reviewed Jan 10, 2023

View reviewed changes

conda/plugins/virtual_packages/cuda.py Show resolved Hide resolved

XuehaiPan added 2 commits January 11, 2023 14:03

Add more CUDA driver library search paths

2ad3164

Update to Python 3.7+ syntax

2180c42

kkraus14 previously approved these changes Jan 11, 2023

View reviewed changes

jezdez reviewed Jan 16, 2023

View reviewed changes

conda/plugins/virtual_packages/cuda.py Outdated Show resolved Hide resolved

conda/plugins/virtual_packages/cuda.py Show resolved Hide resolved

Minor naming fixes.

48a1ce0

jezdez dismissed kkraus14’s stale review via 48a1ce0 January 16, 2023 16:29

jezdez approved these changes Jan 16, 2023

View reviewed changes

jezdez enabled auto-merge (squash) January 16, 2023 16:29

jezdez merged commit bdacdbb into conda:main Jan 16, 2023

XuehaiPan deleted the cuda-detect branch January 17, 2023 01:20

XuehaiPan mentioned this pull request Feb 10, 2023

Fix NVML visible device parsing pytorch/pytorch#92315

Closed

github-actions bot added the locked [bot] locked due to inactivity label Jan 25, 2024

github-actions bot locked as resolved and limited conversation to collaborators Jan 25, 2024

Detect CUDA driver version in subprocess #11667

Detect CUDA driver version in subprocess #11667

Uh oh!

Conversation

XuehaiPan commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist - did you ...

Uh oh!

Uh oh!

jezdez commented Dec 19, 2022

Uh oh!

jakirkham Dec 19, 2022

Choose a reason for hiding this comment

Uh oh!

XuehaiPan Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Dec 19, 2022

Uh oh!

kkraus14 commented Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XuehaiPan commented Dec 22, 2022

Uh oh!

kkraus14 commented Dec 23, 2022

Uh oh!

jezdez commented Jan 10, 2023

Uh oh!

XuehaiPan commented Jan 10, 2023

Uh oh!

Uh oh!

kkraus14 commented Jan 10, 2023

Uh oh!

XuehaiPan commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkraus14 commented Jan 11, 2023

Uh oh!

kkraus14 left a comment

Choose a reason for hiding this comment

Uh oh!

jezdez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jezdez commented Jan 20, 2023

Uh oh!

jakirkham commented Jan 20, 2023

Uh oh!

leofang commented Jan 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

XuehaiPan commented Jul 28, 2022 •

edited

Loading

XuehaiPan Dec 20, 2022 •

edited

Loading

kkraus14 commented Dec 20, 2022 •

edited

Loading

XuehaiPan commented Jan 11, 2023 •

edited

Loading