-
Notifications
You must be signed in to change notification settings - Fork 2k
Detect CUDA driver version in subprocess #11667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fc1cda9 to
6efb65d
Compare
896dbe6 to
d2f35a9
Compare
65176d2 to
4412dd0
Compare
|
@chenghlee @jakirkham Do you happen to know who we could ask to review this? |
| # Do not inherit file descriptors and handles from the parent process | ||
| # The `fork` start method should be considered unsafe as it can lead to | ||
| # crashes of the subprocess | ||
| ctx = mp.get_context("spawn") | ||
| queue = ctx.SimpleQueue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to do this without starting a subprocess? This feels kind of brittle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to do this without starting a subprocess?
@jakirkham Yes, another approach is to unset "CUDA_VISIBLE_DEVICES" in os.environ before calling cuInit() in the main process.
# Empty `CUDA_VISIBLE_DEVICES` can cause `cuInit()` returns `CUDA_ERROR_NO_DEVICE`
# Invalid `CUDA_VISIBLE_DEVICES` can cause `cuInit()` returns `CUDA_ERROR_INVALID_DEVICE`
os.environ.pop("CUDA_VISIBLE_DEVICES", None)The point here using subprocess is to separate the .so handle (fds) namespaces. Although it adds more complexity.
-
We never release the
.sohandle once we loadlibcuda.so. Thecondaprogram will always keep the NVIDIA device (/dev/nvidia-uvm) and kernel module (nvidia_uvm) in use while solving the environment. A benefit of using a subprocess here is we can release thefdas the subprocess ends. We use caches to memorize the result, so the subprocess will start only once.watch 'lsof -t /dev/nvidia* | xargs -r -L 1 ps -o pid=,user=,command= -p' -
Currently,
condais a standalone program, and it is not meant to beimported. Unset andCUDA_VISIBLE_DEVICESand callcuInit()in the main process seems safe here. But if someone callscuInit()with invalidCUDA_VISIBLE_DEVICESin advance. Then unsettingCUDA_VISIBLE_DEVICESand callcuInit()in conda internals will always fail. The.sohandlers andfds are messed up in the main process.
|
Instead of using the CUDA driver API to accomplish this, we could use the nvml library which doesn't have the same restrictions with regards to Some example code: from ctypes import *
libnvml = CDLL("libnvidia-ml.so.1") # Will need to handle linux vs windows logic here
status = libnvml.nvmlInit_v2() # Should check the result of status here to ensure it succeeded
cudaDriverVersion = c_int()
status = libnvml.nvmlSystemGetCudaDriverVersion_v2(byref(cudaDriverVersion)) # Should check the result of status here to ensure it succeeded
# cudaDriverVersion is now a ctypes c_int with the CUDA driver version |
The NVML library publishes new APIs with |
|
A potentially bad idea: it looks like you can call |
4412dd0 to
3cbf864
Compare
|
Hey all! Thank you @XuehaiPan for continuing to push for change. I'm trying to understand the correctness of this patch given that we plan to tag conda 23.1.0 January 16. Is this ready? On a different notw, there is a larger discussion we could have: now that this virtual package is implemented as a plugin, we could release this in a separate |
3cbf864 to
4b902ab
Compare
@jezdez I think it's ready, but it's depending on the opinions of the code reviewers. Let me summarize the changes in this PR:
Rationale
|
|
It feels really funky to unset the We could use NVML and avoid having to mess with the environment variable and then decide if we want to use the subprocess to more nicely load and unload the library / isolate it from crashing the main conda process. |
@kkraus14 I agree. I can open another PR to refactor with the NVML library. One concern is the version handling and backward compatibility guarantee of the NVML library functions. Different NVML libraries ships different symbols. Newer versioned functions may have extra arguments, and the new and the old ones sometimes are not inter-replaceable. That may require extra maintenance work. For example, check whether the library is new enough to have newer versioned functions with #define NVML_INIT_FLAG_NO_GPUS 1 //!< Don't fail nvmlInit() when no GPUs are found
#define NVML_INIT_FLAG_NO_ATTACH 2 //!< Don't attach GPUs
nvmlReturn_t DECLDIR nvmlInit(void);
nvmlReturn_t DECLDIR nvmlInit_v2(void);
nvmlReturn_t DECLDIR nvmlInit_v3(unsigned int flags);
nvmlReturn_t DECLDIR nvmlInitWithFlags(unsigned int flags);
nvmlReturn_t DECLDIR nvmlShutdown(void);
nvmlReturn_t DECLDIR nvmlSystemGetCudaDriverVersion(int *cudaDriverVersion);
nvmlReturn_t DECLDIR nvmlSystemGetCudaDriverVersion_v2(int *cudaDriverVersion);nvml.h as of CUDA 11.8/***************************************************************************************************/
/** @defgroup nvmlInitializationAndCleanup Initialization and Cleanup
* This chapter describes the methods that handle NVML initialization and cleanup.
* It is the user's responsibility to call \ref nvmlInit_v2() before calling any other methods, and
* nvmlShutdown() once NVML is no longer being used.
* @{
*/
/***************************************************************************************************/
#define NVML_INIT_FLAG_NO_GPUS 1 //!< Don't fail nvmlInit() when no GPUs are found
#define NVML_INIT_FLAG_NO_ATTACH 2 //!< Don't attach GPUs
/**
* Initialize NVML, but don't initialize any GPUs yet.
*
* \note nvmlInit_v3 introduces a "flags" argument, that allows passing boolean values
* modifying the behaviour of nvmlInit().
* \note In NVML 5.319 new nvmlInit_v2 has replaced nvmlInit"_v1" (default in NVML 4.304 and older) that
* did initialize all GPU devices in the system.
*
* This allows NVML to communicate with a GPU
* when other GPUs in the system are unstable or in a bad state. When using this API, GPUs are
* discovered and initialized in nvmlDeviceGetHandleBy* functions instead.
*
* \note To contrast nvmlInit_v2 with nvmlInit"_v1", NVML 4.304 nvmlInit"_v1" will fail when any detected GPU is in
* a bad or unstable state.
*
* For all products.
*
* This method, should be called once before invoking any other methods in the library.
* A reference count of the number of initializations is maintained. Shutdown only occurs
* when the reference count reaches zero.
*
* @return
* - \ref NVML_SUCCESS if NVML has been properly initialized
* - \ref NVML_ERROR_DRIVER_NOT_LOADED if NVIDIA driver is not running
* - \ref NVML_ERROR_NO_PERMISSION if NVML does not have permission to talk to the driver
* - \ref NVML_ERROR_UNKNOWN on any unexpected error
*/
nvmlReturn_t DECLDIR nvmlInit_v2(void);
/**
* nvmlInitWithFlags is a variant of nvmlInit(), that allows passing a set of boolean values
* modifying the behaviour of nvmlInit().
* Other than the "flags" parameter it is completely similar to \ref nvmlInit_v2.
*
* For all products.
*
* @param flags behaviour modifier flags
*
* @return
* - \ref NVML_SUCCESS if NVML has been properly initialized
* - \ref NVML_ERROR_DRIVER_NOT_LOADED if NVIDIA driver is not running
* - \ref NVML_ERROR_NO_PERMISSION if NVML does not have permission to talk to the driver
* - \ref NVML_ERROR_UNKNOWN on any unexpected error
*/
nvmlReturn_t DECLDIR nvmlInitWithFlags(unsigned int flags);
/**
* Shut down NVML by releasing all GPU resources previously allocated with \ref nvmlInit_v2().
*
* For all products.
*
* This method should be called after NVML work is done, once for each call to \ref nvmlInit_v2()
* A reference count of the number of initializations is maintained. Shutdown only occurs
* when the reference count reaches zero. For backwards compatibility, no error is reported if
* nvmlShutdown() is called more times than nvmlInit().
*
* @return
* - \ref NVML_SUCCESS if NVML has been properly shut down
* - \ref NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized
* - \ref NVML_ERROR_UNKNOWN on any unexpected error
*/
nvmlReturn_t DECLDIR nvmlShutdown(void);
/** @} */
/**
* Retrieves the version of the CUDA driver.
*
* For all products.
*
* The CUDA driver version returned will be retreived from the currently installed version of CUDA.
* If the cuda library is not found, this function will return a known supported version number.
*
* @param cudaDriverVersion Reference in which to return the version identifier
*
* @return
* - \ref NVML_SUCCESS if \a cudaDriverVersion has been set
* - \ref NVML_ERROR_INVALID_ARGUMENT if \a cudaDriverVersion is NULL
*/
nvmlReturn_t DECLDIR nvmlSystemGetCudaDriverVersion(int *cudaDriverVersion);
/**
* Retrieves the version of the CUDA driver from the shared library.
*
* For all products.
*
* The returned CUDA driver version by calling cuDriverGetVersion()
*
* @param cudaDriverVersion Reference in which to return the version identifier
*
* @return
* - \ref NVML_SUCCESS if \a cudaDriverVersion has been set
* - \ref NVML_ERROR_INVALID_ARGUMENT if \a cudaDriverVersion is NULL
* - \ref NVML_ERROR_LIBRARY_NOT_FOUND if \a libcuda.so.1 or libcuda.dll is not found
* - \ref NVML_ERROR_FUNCTION_NOT_FOUND if \a cuDriverGetVersion() is not found in the shared library
*/
nvmlReturn_t DECLDIR nvmlSystemGetCudaDriverVersion_v2(int *cudaDriverVersion);
/**
* Macros for converting the CUDA driver version number to Major and Minor version numbers.
*/
#define NVML_CUDA_DRIVER_VERSION_MAJOR(v) ((v)/1000)
#define NVML_CUDA_DRIVER_VERSION_MINOR(v) (((v)%1000)/10) |
|
Given that none of the older functions are deprecated or removed we could just use them and hold off on any kind of versioning until needed. See my comment here: #11667 (comment) Driver 378, which was when |
kkraus14
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than refactoring to use nvml instead of the driver library, but we can do that in a follow up
jezdez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor code fixes.
|
@XuehaiPan @kkraus14 Before I forget, do you feel like it's important to file a ticket for the refactoring to use the nvml library? Alternatively, does anyone know how likely it would be for us to extract the cuda virtual package into a separate repo? Any takers from Nvidia? |
|
It's probably a good idea to have an issue We could look into a plugin, but that won't be for a while Highest priority for me is getting new packages together ( conda-forge/staged-recipes#21382 ) and starting CUDA 12 builds |
|
NVML should be the way to go (e.g. see here). Sorry to see/raise this late, but this feels a bit awkward to me 😅 Like Keith said, the needed NVML symbols have existed for very long time, it should be preferred for detecting the driver version without initializing the context. |
Description
Followed by #11626, see #11626 (comment).
The environment variable
CUDA_VISIBLE_DEVICEScan causelibcuda.cuInit()fails (returns non-zero). The CUDA driver detector detects the driver version in the current process. The user-defined environment variable will let it fail.For example:
CUDA_VISIBLE_DEVICESwill causelibcuda.cuInit()returnsCUDA_ERROR_NO_DEVICE(100).CUDA_VISIBLE_DEVICESwill causelibcuda.cuInit()returnsCUDA_ERROR_INVALID_DEVICE(101).In this PR, we detect the CUDA driver version in a spawned subprocess, which unset the environment variable
CUDA_VISIBLE_DEVICESbefore callinglibcuda.cuInit().Checklist - did you ...
newsdirectory (using the template) for the next release's release notes?Add / update necessary tests?Add / update outdated documentation?