这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@XuehaiPan
Copy link
Contributor

@XuehaiPan XuehaiPan commented Jul 28, 2022

Description

Followed by #11626, see #11626 (comment).

The environment variable CUDA_VISIBLE_DEVICES can cause libcuda.cuInit() fails (returns non-zero). The CUDA driver detector detects the driver version in the current process. The user-defined environment variable will let it fail.

For example:

  • An empty CUDA_VISIBLE_DEVICES will cause libcuda.cuInit() returns CUDA_ERROR_NO_DEVICE (100).
$ CUDA_VISIBLE_DEVICES='' python3 -c 'from conda.plugins.virtual_packages import cuda; print(cuda.cuda_version())'
None
  • An invalid CUDA_VISIBLE_DEVICES will cause libcuda.cuInit() returns CUDA_ERROR_INVALID_DEVICE (101).
$ CUDA_VISIBLE_DEVICES='0,0' python3 -c 'from conda.plugins.virtual_packages import cuda; print(cuda.cuda_version())'
None

In this PR, we detect the CUDA driver version in a spawned subprocess, which unset the environment variable CUDA_VISIBLE_DEVICES before calling libcuda.cuInit().

$ CUDA_VISIBLE_DEVICES='' python3 -c 'from conda.plugins.virtual_packages import cuda; print(cuda.cuda_version())'
11.7

$ CUDA_VISIBLE_DEVICES='0,0' python3 -c 'from conda.plugins.virtual_packages import cuda; print(cuda.cuda_version())'
11.7

Checklist - did you ...

  • Add a file to the news directory (using the template) for the next release's release notes?
  • Add / update necessary tests?
  • Add / update outdated documentation?

@XuehaiPan XuehaiPan requested a review from a team as a code owner July 28, 2022 11:54
@conda-bot conda-bot added the cla-signed [bot] added once the contributor has signed the CLA label Jul 28, 2022
@travishathaway travishathaway added the source::community catch-all for issues filed by community members label Jul 28, 2022
@XuehaiPan XuehaiPan force-pushed the cuda-detect branch 2 times, most recently from fc1cda9 to 6efb65d Compare July 30, 2022 09:09
@XuehaiPan XuehaiPan force-pushed the cuda-detect branch 2 times, most recently from 896dbe6 to d2f35a9 Compare November 30, 2022 13:10
@XuehaiPan XuehaiPan force-pushed the cuda-detect branch 2 times, most recently from 65176d2 to 4412dd0 Compare December 7, 2022 13:02
@jezdez jezdez added this to the 23.1.0 milestone Dec 19, 2022
@jezdez jezdez requested a review from chenghlee December 19, 2022 21:26
@jezdez
Copy link
Member

jezdez commented Dec 19, 2022

@chenghlee @jakirkham Do you happen to know who we could ask to review this?

Comment on lines 28 to 32
# Do not inherit file descriptors and handles from the parent process
# The `fork` start method should be considered unsafe as it can lead to
# crashes of the subprocess
ctx = mp.get_context("spawn")
queue = ctx.SimpleQueue()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to do this without starting a subprocess? This feels kind of brittle

Copy link
Contributor Author

@XuehaiPan XuehaiPan Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to do this without starting a subprocess?

@jakirkham Yes, another approach is to unset "CUDA_VISIBLE_DEVICES" in os.environ before calling cuInit() in the main process.

# Empty `CUDA_VISIBLE_DEVICES` can cause `cuInit()` returns `CUDA_ERROR_NO_DEVICE`
# Invalid `CUDA_VISIBLE_DEVICES` can cause `cuInit()` returns `CUDA_ERROR_INVALID_DEVICE`
os.environ.pop("CUDA_VISIBLE_DEVICES", None)

The point here using subprocess is to separate the .so handle (fds) namespaces. Although it adds more complexity.

  1. We never release the .so handle once we load libcuda.so. The conda program will always keep the NVIDIA device (/dev/nvidia-uvm) and kernel module (nvidia_uvm) in use while solving the environment. A benefit of using a subprocess here is we can release the fd as the subprocess ends. We use caches to memorize the result, so the subprocess will start only once.

    watch 'lsof -t /dev/nvidia* | xargs -r -L 1 ps -o pid=,user=,command= -p'
    image
  2. Currently, conda is a standalone program, and it is not meant to be imported. Unset and CUDA_VISIBLE_DEVICES and call cuInit() in the main process seems safe here. But if someone calls cuInit() with invalid CUDA_VISIBLE_DEVICES in advance. Then unsetting CUDA_VISIBLE_DEVICES and call cuInit() in conda internals will always fail. The .so handlers and fds are messed up in the main process.

@jakirkham
Copy link
Member

Think @seibert wrote the original CUDA driver version detection logic.

@kkraus14 and @jaimergp may also have thoughts (Jaime in particular has gone down the CUDA Windows rabbit hole before).

Am happy to look as well. Though this may be something that occurs in the new year.

@kkraus14
Copy link

kkraus14 commented Dec 20, 2022

Instead of using the CUDA driver API to accomplish this, we could use the nvml library which doesn't have the same restrictions with regards to CUDA_VISIBLE_DEVICES.

Some example code:

from ctypes import *

libnvml = CDLL("libnvidia-ml.so.1")  # Will need to handle linux vs windows logic here
status = libnvml.nvmlInit_v2()  # Should check the result of status here to ensure it succeeded

cudaDriverVersion = c_int()
status = libnvml.nvmlSystemGetCudaDriverVersion_v2(byref(cudaDriverVersion))  # Should check the result of status here to ensure it succeeded

# cudaDriverVersion is now a ctypes c_int with the CUDA driver version

@XuehaiPan
Copy link
Contributor Author

we could use the nvml library which doesn't have the same restrictions with regards to CUDA_VISIBLE_DEVICES.

The NVML library publishes new APIs with _vN suffix (e.g., nvmlInit_v2). The newly versioned APIs (symbols in libnvidia-ml.so) may not exist on the user side if they use old drivers.

@kkraus14
Copy link

nvmlInit_v2 was added in the CUDA 5.5 release of NVML so this should be fine. We can move to use nvmlInit if we want to be even more conservative (which is silly IMO).

nvmlSystemGetCudaDriverVersion_v2 was added in CUDA 10 release of NVML so maybe we should avoid using this. It looks like nvmlSystemGetCudaDriverVersion was added between CUDA 8 and CUDA 9 in driver 378 (NVIDIA/nvidia-settings@c9e5453) so that would be pretty good as well.


A potentially bad idea: it looks like you can call cuDriverGetVersion without calling cuInit first and it still returns the correct version without error. This is definitely relying on implementation details of the CUDA driver and could change without notice in the future.

@jezdez jezdez self-assigned this Jan 9, 2023
@jezdez
Copy link
Member

jezdez commented Jan 10, 2023

Hey all! Thank you @XuehaiPan for continuing to push for change.

I'm trying to understand the correctness of this patch given that we plan to tag conda 23.1.0 January 16. Is this ready?

On a different notw, there is a larger discussion we could have: now that this virtual package is implemented as a plugin, we could release this in a separate conda-cuda-virtual-package (or similar) package, maybe co-maintained with folks from Nvidia? @jakirkham @kkraus14 Any opinions on this?

@XuehaiPan
Copy link
Contributor Author

Is this ready?

@jezdez I think it's ready, but it's depending on the opinions of the code reviewers.


Let me summarize the changes in this PR:

  1. Re-add the libcuda.so search path for WSL support.

  2. Put the whole CUDA detection function in a separate subprocess:

    1. Main process: Start a new subprocess with the spawn start method.
    2. Subprocess: Detect the CUDA version in the subprocess and put the result in a SimpleQueue.
      • New change: unset CUDA_VISIBLE_DEVICES before initializing the CUDA driver context.
    3. Main process: receive the result from the subprocess and return the result to the function caller.

Rationale

  1. Loading a .so file in the main process is not safe. If the driver library is corrupted, then the conda program always fails (Segmentation Fault). Putting the CUDA detector in a daemon process, if it fails, the caller can return None rather than crash.
  2. Better process environment management. The CUDA driver library requires valid CUDA_VISIBLE_DEVICES. Using subprocesses, we can separate the environment rather than mess them together.
  3. Early release of the CUDA library handle. We never release the .so library handle once we load it. If we load the CUDA library in the main process, it will cause the NVIDIA device and kernel module in use during the whole conda program lifetime. With subprocess, we can release the handle as soon as we get the result and exit the subprocess.

@kkraus14
Copy link

It feels really funky to unset the CUDA_VISIBLE_DEVICES environment variable in a spawned subprocess. Almost feels like this is conda overstepping its bounds.

We could use NVML and avoid having to mess with the environment variable and then decide if we want to use the subprocess to more nicely load and unload the library / isolate it from crashing the main conda process.

@XuehaiPan
Copy link
Contributor Author

XuehaiPan commented Jan 11, 2023

Almost feels like this is conda overstepping its bounds.

@kkraus14 I agree. I can open another PR to refactor with the NVML library.

One concern is the version handling and backward compatibility guarantee of the NVML library functions. Different NVML libraries ships different symbols. Newer versioned functions may have extra arguments, and the new and the old ones sometimes are not inter-replaceable.

That may require extra maintenance work. For example, check whether the library is new enough to have newer versioned functions with hasattr(lib, '<symbol_vN>'), otherwise fallback to _v(N-1). Then call the function with proper arguments (may be different between versions). Every time when a new NVIDIA driver is released, we will also need to update the version handling again.

#define NVML_INIT_FLAG_NO_GPUS      1   //!< Don't fail nvmlInit() when no GPUs are found
#define NVML_INIT_FLAG_NO_ATTACH    2   //!< Don't attach GPUs

nvmlReturn_t DECLDIR nvmlInit(void);
nvmlReturn_t DECLDIR nvmlInit_v2(void);
nvmlReturn_t DECLDIR nvmlInit_v3(unsigned int flags);
nvmlReturn_t DECLDIR nvmlInitWithFlags(unsigned int flags);
nvmlReturn_t DECLDIR nvmlShutdown(void);

nvmlReturn_t DECLDIR nvmlSystemGetCudaDriverVersion(int *cudaDriverVersion);
nvmlReturn_t DECLDIR nvmlSystemGetCudaDriverVersion_v2(int *cudaDriverVersion);
nvml.h as of CUDA 11.8
/***************************************************************************************************/
/** @defgroup nvmlInitializationAndCleanup Initialization and Cleanup
 * This chapter describes the methods that handle NVML initialization and cleanup.
 * It is the user's responsibility to call \ref nvmlInit_v2() before calling any other methods, and
 * nvmlShutdown() once NVML is no longer being used.
 *  @{
 */
/***************************************************************************************************/

#define NVML_INIT_FLAG_NO_GPUS      1   //!< Don't fail nvmlInit() when no GPUs are found
#define NVML_INIT_FLAG_NO_ATTACH    2   //!< Don't attach GPUs

/**
 * Initialize NVML, but don't initialize any GPUs yet.
 *
 * \note nvmlInit_v3 introduces a "flags" argument, that allows passing boolean values
 *       modifying the behaviour of nvmlInit().
 * \note In NVML 5.319 new nvmlInit_v2 has replaced nvmlInit"_v1" (default in NVML 4.304 and older) that
 *       did initialize all GPU devices in the system.
 *
 * This allows NVML to communicate with a GPU
 * when other GPUs in the system are unstable or in a bad state.  When using this API, GPUs are
 * discovered and initialized in nvmlDeviceGetHandleBy* functions instead.
 *
 * \note To contrast nvmlInit_v2 with nvmlInit"_v1", NVML 4.304 nvmlInit"_v1" will fail when any detected GPU is in
 *       a bad or unstable state.
 *
 * For all products.
 *
 * This method, should be called once before invoking any other methods in the library.
 * A reference count of the number of initializations is maintained.  Shutdown only occurs
 * when the reference count reaches zero.
 *
 * @return
 *         - \ref NVML_SUCCESS                   if NVML has been properly initialized
 *         - \ref NVML_ERROR_DRIVER_NOT_LOADED   if NVIDIA driver is not running
 *         - \ref NVML_ERROR_NO_PERMISSION       if NVML does not have permission to talk to the driver
 *         - \ref NVML_ERROR_UNKNOWN             on any unexpected error
 */
nvmlReturn_t DECLDIR nvmlInit_v2(void);

/**
 * nvmlInitWithFlags is a variant of nvmlInit(), that allows passing a set of boolean values
 *       modifying the behaviour of nvmlInit().
 *       Other than the "flags" parameter it is completely similar to \ref nvmlInit_v2.
 *
 * For all products.
 *
 * @param flags                                 behaviour modifier flags
 *
 * @return
 *         - \ref NVML_SUCCESS                   if NVML has been properly initialized
 *         - \ref NVML_ERROR_DRIVER_NOT_LOADED   if NVIDIA driver is not running
 *         - \ref NVML_ERROR_NO_PERMISSION       if NVML does not have permission to talk to the driver
 *         - \ref NVML_ERROR_UNKNOWN             on any unexpected error
 */
nvmlReturn_t DECLDIR nvmlInitWithFlags(unsigned int flags);

/**
 * Shut down NVML by releasing all GPU resources previously allocated with \ref nvmlInit_v2().
 *
 * For all products.
 *
 * This method should be called after NVML work is done, once for each call to \ref nvmlInit_v2()
 * A reference count of the number of initializations is maintained.  Shutdown only occurs
 * when the reference count reaches zero.  For backwards compatibility, no error is reported if
 * nvmlShutdown() is called more times than nvmlInit().
 *
 * @return
 *         - \ref NVML_SUCCESS                 if NVML has been properly shut down
 *         - \ref NVML_ERROR_UNINITIALIZED     if the library has not been successfully initialized
 *         - \ref NVML_ERROR_UNKNOWN           on any unexpected error
 */
nvmlReturn_t DECLDIR nvmlShutdown(void);

/** @} */

/**
 * Retrieves the version of the CUDA driver.
 *
 * For all products.
 *
 * The CUDA driver version returned will be retreived from the currently installed version of CUDA.
 * If the cuda library is not found, this function will return a known supported version number.
 *
 * @param cudaDriverVersion                    Reference in which to return the version identifier
 *
 * @return
 *         - \ref NVML_SUCCESS                 if \a cudaDriverVersion has been set
 *         - \ref NVML_ERROR_INVALID_ARGUMENT  if \a cudaDriverVersion is NULL
 */
nvmlReturn_t DECLDIR nvmlSystemGetCudaDriverVersion(int *cudaDriverVersion);

/**
 * Retrieves the version of the CUDA driver from the shared library.
 *
 * For all products.
 *
 * The returned CUDA driver version by calling cuDriverGetVersion()
 *
 * @param cudaDriverVersion                    Reference in which to return the version identifier
 *
 * @return
 *         - \ref NVML_SUCCESS                  if \a cudaDriverVersion has been set
 *         - \ref NVML_ERROR_INVALID_ARGUMENT   if \a cudaDriverVersion is NULL
 *         - \ref NVML_ERROR_LIBRARY_NOT_FOUND  if \a libcuda.so.1 or libcuda.dll is not found
 *         - \ref NVML_ERROR_FUNCTION_NOT_FOUND if \a cuDriverGetVersion() is not found in the shared library
 */
nvmlReturn_t DECLDIR nvmlSystemGetCudaDriverVersion_v2(int *cudaDriverVersion);

/**
 * Macros for converting the CUDA driver version number to Major and Minor version numbers.
 */
#define NVML_CUDA_DRIVER_VERSION_MAJOR(v) ((v)/1000)
#define NVML_CUDA_DRIVER_VERSION_MINOR(v) (((v)%1000)/10)

@kkraus14
Copy link

Given that none of the older functions are deprecated or removed we could just use them and hold off on any kind of versioning until needed.

See my comment here: #11667 (comment)

Driver 378, which was when nvmlSystemGetCudaDriverVersion was included, released ~6 years ago so I think it's pretty reasonable that we don't need to be able to detect the CUDA version on drivers older than that. Besides, the oldest version of the CUDA toolkit packaged as a conda package is 8.0 and its only packaged for Windows on the anaconda channel. The oldest version packaged on conda-forge is 9.2.

kkraus14
kkraus14 previously approved these changes Jan 11, 2023
Copy link

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than refactoring to use nvml instead of the driver library, but we can do that in a follow up

Copy link
Member

@jezdez jezdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor code fixes.

@jezdez jezdez enabled auto-merge (squash) January 16, 2023 16:29
@jezdez jezdez merged commit bdacdbb into conda:main Jan 16, 2023
@XuehaiPan XuehaiPan deleted the cuda-detect branch January 17, 2023 01:20
@jezdez
Copy link
Member

jezdez commented Jan 20, 2023

@XuehaiPan @kkraus14 Before I forget, do you feel like it's important to file a ticket for the refactoring to use the nvml library?

Alternatively, does anyone know how likely it would be for us to extract the cuda virtual package into a separate repo? Any takers from Nvidia?

@jakirkham
Copy link
Member

It's probably a good idea to have an issue

We could look into a plugin, but that won't be for a while

Highest priority for me is getting new packages together ( conda-forge/staged-recipes#21382 ) and starting CUDA 12 builds

@leofang
Copy link

leofang commented Jan 24, 2023

NVML should be the way to go (e.g. see here). Sorry to see/raise this late, but this feels a bit awkward to me 😅 Like Keith said, the needed NVML symbols have existed for very long time, it should be preferred for detecting the driver version without initializing the context.

@github-actions github-actions bot added the locked [bot] locked due to inactivity label Jan 25, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

cla-signed [bot] added once the contributor has signed the CLA locked [bot] locked due to inactivity source::community catch-all for issues filed by community members

Projects

No open projects
Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants