这是indexloc提供的服务,不要输入任何密码
Skip to content

[STF] Allow CUfunction/CUkernel (driver API) in the cuda_kernel(_chain) API #5215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

caugonnet
Copy link
Contributor

@caugonnet caugonnet commented Jul 11, 2025

Description

Make it possible to pass a CUfunction to ctx.cuda_kernel and ctx.cuda_kernel_chain constructs

closes

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@caugonnet caugonnet requested a review from a team as a code owner July 11, 2025 11:56
@caugonnet caugonnet requested a review from fbusato July 11, 2025 11:56
@github-project-automation github-project-automation bot moved this to Todo in CCCL Jul 11, 2025
Copy link

copy-pr-bot bot commented Jul 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@caugonnet caugonnet self-assigned this Jul 11, 2025
@caugonnet caugonnet added the stf Sequential Task Flow programming model label Jul 11, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Jul 11, 2025
@caugonnet
Copy link
Contributor Author

/ok to test b3304a1

@caugonnet
Copy link
Contributor Author

/ok to test 8651e9f

@caugonnet
Copy link
Contributor Author

/ok to test 0a0cd17

Copy link
Contributor

🟩 CI finished in 33m 48s: Pass: 100%/32 | Total: 6h 58m | Avg: 13m 05s | Max: 28m 23s | Hits: 75%/16246
  • 🟩 cudax: Pass: 100%/28 | Total: 6h 44m | Avg: 14m 25s | Max: 28m 23s | Hits: 75%/16246

    🟩 cpu
      🟩 amd64              Pass: 100%/24  | Total:  5h 49m | Avg: 14m 32s | Max: 28m 23s | Hits:  76%/13754 
      🟩 arm64              Pass: 100%/4   | Total: 55m 02s | Avg: 13m 45s | Max: 14m 55s | Hits:  70%/2492  
    🟩 ctk
      🟩 12.0               Pass: 100%/3   | Total: 37m 04s | Avg: 12m 21s | Max: 13m 17s | Hits:  76%/1568  
      🟩 12.9               Pass: 100%/25  | Total:  6h 07m | Avg: 14m 40s | Max: 28m 23s | Hits:  75%/14678 
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/3   | Total: 37m 04s | Avg: 12m 21s | Max: 13m 17s | Hits:  76%/1568  
      🟩 nvcc12.9           Pass: 100%/25  | Total:  6h 07m | Avg: 14m 40s | Max: 28m 23s | Hits:  75%/14678 
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/28  | Total:  6h 44m | Avg: 14m 25s | Max: 28m 23s | Hits:  75%/16246 
    🟩 cxx
      🟩 Clang14            Pass: 100%/2   | Total: 25m 15s | Avg: 12m 37s | Max: 13m 28s | Hits:  71%/1248  
      🟩 Clang15            Pass: 100%/1   | Total: 15m 30s | Avg: 15m 30s | Max: 15m 30s | Hits:  70%/623   
      🟩 Clang16            Pass: 100%/1   | Total: 14m 46s | Avg: 14m 46s | Max: 14m 46s | Hits:  70%/623   
      🟩 Clang17            Pass: 100%/1   | Total: 14m 37s | Avg: 14m 37s | Max: 14m 37s | Hits:  70%/623   
      🟩 Clang18            Pass: 100%/1   | Total: 14m 37s | Avg: 14m 37s | Max: 14m 37s | Hits:  70%/623   
      🟩 Clang19            Pass: 100%/4   | Total: 49m 35s | Avg: 12m 23s | Max: 15m 30s | Hits:  78%/2492  
      🟩 GCC10              Pass: 100%/2   | Total: 28m 18s | Avg: 14m 09s | Max: 15m 01s | Hits:  70%/1248  
      🟩 GCC11              Pass: 100%/1   | Total: 17m 04s | Avg: 17m 04s | Max: 17m 04s | Hits:  70%/623   
      🟩 GCC12              Pass: 100%/1   | Total: 16m 58s | Avg: 16m 58s | Max: 16m 58s | Hits:  70%/623   
      🟩 GCC13              Pass: 100%/8   | Total:  1h 45m | Avg: 13m 11s | Max: 16m 49s | Hits:  77%/4984  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 12m 00s | Avg: 12m 00s | Max: 12m 00s | Hits:  95%/322   
      🟩 MSVC14.43          Pass: 100%/3   | Total: 34m 47s | Avg: 11m 35s | Max: 12m 35s | Hits:  95%/972   
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 55m 05s | Avg: 27m 32s | Max: 28m 23s | Hits:  68%/1242  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/10  | Total:  2h 14m | Avg: 13m 26s | Max: 15m 30s | Hits:  73%/6232  
      🟩 GCC                Pass: 100%/12  | Total:  2h 47m | Avg: 13m 59s | Max: 17m 04s | Hits:  75%/7478  
      🟩 MSVC               Pass: 100%/4   | Total: 46m 47s | Avg: 11m 41s | Max: 12m 35s | Hits:  95%/1294  
      🟩 NVHPC              Pass: 100%/2   | Total: 55m 05s | Avg: 27m 32s | Max: 28m 23s | Hits:  68%/1242  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 20m 27s | Avg: 10m 13s | Max: 11m 19s | Hits:  85%/1246  
      🟩 rtx2080            Pass: 100%/26  | Total:  6h 23m | Avg: 14m 45s | Max: 28m 23s | Hits:  75%/15000 
    🟩 jobs
      🟩 Build              Pass: 100%/25  | Total:  6h 16m | Avg: 15m 03s | Max: 28m 23s | Hits:  72%/14377 
      🟩 Test               Pass: 100%/3   | Total: 27m 29s | Avg:  9m 09s | Max:  9m 54s | Hits:  99%/1869  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 20m 27s | Avg: 10m 13s | Max: 11m 19s | Hits:  85%/1246  
      🟩 90;90a             Pass: 100%/2   | Total: 24m 27s | Avg: 12m 13s | Max: 13m 45s | Hits:  79%/947   
      🟩 100;120            Pass: 100%/2   | Total: 26m 46s | Avg: 13m 23s | Max: 15m 16s | Hits:  79%/947   
    🟩 std
      🟩 17                 Pass: 100%/3   | Total: 53m 26s | Avg: 17m 48s | Max: 26m 42s | Hits:  69%/1867  
      🟩 20                 Pass: 100%/25  | Total:  5h 50m | Avg: 14m 01s | Max: 28m 23s | Hits:  76%/14379 
    
  • 🟩 packaging: Pass: 100%/4 | Total: 14m 47s | Avg: 3m 41s | Max: 3m 48s

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total: 14m 47s | Avg:  3m 41s | Max:  3m 48s
    🟩 ctk
      🟩 12.0               Pass: 100%/2   | Total:  7m 17s | Avg:  3m 38s | Max:  3m 40s
      🟩 12.9               Pass: 100%/2   | Total:  7m 30s | Avg:  3m 45s | Max:  3m 48s
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/2   | Total:  7m 17s | Avg:  3m 38s | Max:  3m 40s
      🟩 nvcc12.9           Pass: 100%/2   | Total:  7m 30s | Avg:  3m 45s | Max:  3m 48s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 14m 47s | Avg:  3m 41s | Max:  3m 48s
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  3m 40s | Avg:  3m 40s | Max:  3m 40s
      🟩 Clang19            Pass: 100%/1   | Total:  3m 48s | Avg:  3m 48s | Max:  3m 48s
      🟩 GCC12              Pass: 100%/1   | Total:  3m 37s | Avg:  3m 37s | Max:  3m 37s
      🟩 GCC13              Pass: 100%/1   | Total:  3m 42s | Avg:  3m 42s | Max:  3m 42s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/2   | Total:  7m 28s | Avg:  3m 44s | Max:  3m 48s
      🟩 GCC                Pass: 100%/2   | Total:  7m 19s | Avg:  3m 39s | Max:  3m 42s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 14m 47s | Avg:  3m 41s | Max:  3m 48s
    🟩 jobs
      🟩 Test               Pass: 100%/4   | Total: 14m 47s | Avg:  3m 41s | Max:  3m 48s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 32)

# Runner
17 linux-amd64-cpu16
6 linux-amd64-gpu-rtx2080-latest-1
4 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1

@caugonnet caugonnet enabled auto-merge (squash) July 11, 2025 12:52
@caugonnet caugonnet disabled auto-merge July 11, 2025 13:45
@caugonnet
Copy link
Contributor Author

/ok to test fd42c70

@caugonnet caugonnet enabled auto-merge (squash) July 11, 2025 14:11
@caugonnet
Copy link
Contributor Author

/ok to test e12de1c

Copy link
Contributor

🟩 CI finished in 28m 23s: Pass: 100%/32 | Total: 7h 07m | Avg: 13m 21s | Max: 28m 19s | Hits: 75%/16246
  • 🟩 cudax: Pass: 100%/28 | Total: 6h 52m | Avg: 14m 44s | Max: 28m 19s | Hits: 75%/16246

    🟩 cpu
      🟩 amd64              Pass: 100%/24  | Total:  5h 56m | Avg: 14m 51s | Max: 28m 19s | Hits:  76%/13754 
      🟩 arm64              Pass: 100%/4   | Total: 56m 06s | Avg: 14m 01s | Max: 15m 34s | Hits:  70%/2492  
    🟩 ctk
      🟩 12.0               Pass: 100%/3   | Total: 36m 58s | Avg: 12m 19s | Max: 13m 16s | Hits:  76%/1568  
      🟩 12.9               Pass: 100%/25  | Total:  6h 15m | Avg: 15m 01s | Max: 28m 19s | Hits:  75%/14678 
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/3   | Total: 36m 58s | Avg: 12m 19s | Max: 13m 16s | Hits:  76%/1568  
      🟩 nvcc12.9           Pass: 100%/25  | Total:  6h 15m | Avg: 15m 01s | Max: 28m 19s | Hits:  75%/14678 
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/28  | Total:  6h 52m | Avg: 14m 44s | Max: 28m 19s | Hits:  75%/16246 
    🟩 cxx
      🟩 Clang14            Pass: 100%/2   | Total: 27m 14s | Avg: 13m 37s | Max: 15m 23s | Hits:  71%/1248  
      🟩 Clang15            Pass: 100%/1   | Total: 15m 56s | Avg: 15m 56s | Max: 15m 56s | Hits:  70%/623   
      🟩 Clang16            Pass: 100%/1   | Total: 14m 38s | Avg: 14m 38s | Max: 14m 38s | Hits:  70%/623   
      🟩 Clang17            Pass: 100%/1   | Total: 15m 26s | Avg: 15m 26s | Max: 15m 26s | Hits:  70%/623   
      🟩 Clang18            Pass: 100%/1   | Total: 15m 20s | Avg: 15m 20s | Max: 15m 20s | Hits:  70%/623   
      🟩 Clang19            Pass: 100%/4   | Total: 50m 25s | Avg: 12m 36s | Max: 15m 40s | Hits:  78%/2492  
      🟩 GCC10              Pass: 100%/2   | Total: 30m 08s | Avg: 15m 04s | Max: 16m 52s | Hits:  70%/1248  
      🟩 GCC11              Pass: 100%/1   | Total: 16m 13s | Avg: 16m 13s | Max: 16m 13s | Hits:  70%/623   
      🟩 GCC12              Pass: 100%/1   | Total: 17m 11s | Avg: 17m 11s | Max: 17m 11s | Hits:  70%/623   
      🟩 GCC13              Pass: 100%/8   | Total:  1h 47m | Avg: 13m 22s | Max: 17m 15s | Hits:  77%/4984  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 11m 51s | Avg: 11m 51s | Max: 11m 51s | Hits:  95%/322   
      🟩 MSVC14.43          Pass: 100%/3   | Total: 36m 06s | Avg: 12m 02s | Max: 12m 33s | Hits:  95%/972   
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 55m 08s | Avg: 27m 34s | Max: 28m 19s | Hits:  68%/1242  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/10  | Total:  2h 18m | Avg: 13m 53s | Max: 15m 56s | Hits:  73%/6232  
      🟩 GCC                Pass: 100%/12  | Total:  2h 50m | Avg: 14m 12s | Max: 17m 15s | Hits:  75%/7478  
      🟩 MSVC               Pass: 100%/4   | Total: 47m 57s | Avg: 11m 59s | Max: 12m 33s | Hits:  95%/1294  
      🟩 NVHPC              Pass: 100%/2   | Total: 55m 08s | Avg: 27m 34s | Max: 28m 19s | Hits:  68%/1242  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 20m 53s | Avg: 10m 26s | Max: 11m 16s | Hits:  85%/1246  
      🟩 rtx2080            Pass: 100%/26  | Total:  6h 31m | Avg: 15m 04s | Max: 28m 19s | Hits:  75%/15000 
    🟩 jobs
      🟩 Build              Pass: 100%/25  | Total:  6h 24m | Avg: 15m 23s | Max: 28m 19s | Hits:  72%/14377 
      🟩 Test               Pass: 100%/3   | Total: 27m 57s | Avg:  9m 19s | Max:  9m 55s | Hits:  99%/1869  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 20m 53s | Avg: 10m 26s | Max: 11m 16s | Hits:  85%/1246  
      🟩 90;90a             Pass: 100%/2   | Total: 25m 46s | Avg: 12m 53s | Max: 14m 22s | Hits:  79%/947   
      🟩 100;120            Pass: 100%/2   | Total: 27m 23s | Avg: 13m 41s | Max: 14m 50s | Hits:  79%/947   
    🟩 std
      🟩 17                 Pass: 100%/3   | Total: 53m 53s | Avg: 17m 57s | Max: 26m 49s | Hits:  69%/1867  
      🟩 20                 Pass: 100%/25  | Total:  5h 58m | Avg: 14m 20s | Max: 28m 19s | Hits:  76%/14379 
    
  • 🟩 packaging: Pass: 100%/4 | Total: 15m 01s | Avg: 3m 45s | Max: 4m 01s

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total: 15m 01s | Avg:  3m 45s | Max:  4m 01s
    🟩 ctk
      🟩 12.0               Pass: 100%/2   | Total:  7m 06s | Avg:  3m 33s | Max:  3m 34s
      🟩 12.9               Pass: 100%/2   | Total:  7m 55s | Avg:  3m 57s | Max:  4m 01s
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/2   | Total:  7m 06s | Avg:  3m 33s | Max:  3m 34s
      🟩 nvcc12.9           Pass: 100%/2   | Total:  7m 55s | Avg:  3m 57s | Max:  4m 01s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 15m 01s | Avg:  3m 45s | Max:  4m 01s
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  3m 32s | Avg:  3m 32s | Max:  3m 32s
      🟩 Clang19            Pass: 100%/1   | Total:  3m 54s | Avg:  3m 54s | Max:  3m 54s
      🟩 GCC12              Pass: 100%/1   | Total:  3m 34s | Avg:  3m 34s | Max:  3m 34s
      🟩 GCC13              Pass: 100%/1   | Total:  4m 01s | Avg:  4m 01s | Max:  4m 01s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/2   | Total:  7m 26s | Avg:  3m 43s | Max:  3m 54s
      🟩 GCC                Pass: 100%/2   | Total:  7m 35s | Avg:  3m 47s | Max:  4m 01s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 15m 01s | Avg:  3m 45s | Max:  4m 01s
    🟩 jobs
      🟩 Test               Pass: 100%/4   | Total: 15m 01s | Avg:  3m 45s | Max:  4m 01s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 32)

# Runner
17 linux-amd64-cpu16
6 linux-amd64-gpu-rtx2080-latest-1
4 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1

@caugonnet
Copy link
Contributor Author

/ok to test 74a4577

Copy link
Contributor

🟩 CI finished in 14m 51s: Pass: 100%/32 | Total: 2h 46m | Avg: 5m 12s | Max: 11m 31s | Hits: 98%/15930
  • 🟩 cudax: Pass: 100%/28 | Total: 2h 30m | Avg: 5m 23s | Max: 11m 31s | Hits: 98%/15930

    🟩 cpu
      🟩 amd64              Pass: 100%/24  | Total:  2h 18m | Avg:  5m 45s | Max: 11m 31s | Hits:  97%/13482 
      🟩 arm64              Pass: 100%/4   | Total: 12m 48s | Avg:  3m 12s | Max:  3m 34s | Hits:  99%/2448  
    🟩 ctk
      🟩 12.0               Pass: 100%/3   | Total: 16m 57s | Avg:  5m 39s | Max: 10m 33s | Hits:  85%/1533  
      🟩 12.9               Pass: 100%/25  | Total:  2h 14m | Avg:  5m 21s | Max: 11m 31s | Hits:  99%/14397 
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/3   | Total: 16m 57s | Avg:  5m 39s | Max: 10m 33s | Hits:  85%/1533  
      🟩 nvcc12.9           Pass: 100%/25  | Total:  2h 14m | Avg:  5m 21s | Max: 11m 31s | Hits:  99%/14397 
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/28  | Total:  2h 30m | Avg:  5m 23s | Max: 11m 31s | Hits:  98%/15930 
    🟩 cxx
      🟩 Clang14            Pass: 100%/2   | Total:  6m 31s | Avg:  3m 15s | Max:  3m 28s | Hits:  91%/1226  
      🟩 Clang15            Pass: 100%/1   | Total:  3m 24s | Avg:  3m 24s | Max:  3m 24s | Hits: 100%/612   
      🟩 Clang16            Pass: 100%/1   | Total:  3m 27s | Avg:  3m 27s | Max:  3m 27s | Hits: 100%/612   
      🟩 Clang17            Pass: 100%/1   | Total:  3m 22s | Avg:  3m 22s | Max:  3m 22s | Hits: 100%/612   
      🟩 Clang18            Pass: 100%/1   | Total:  3m 26s | Avg:  3m 26s | Max:  3m 26s | Hits: 100%/612   
      🟩 Clang19            Pass: 100%/4   | Total: 17m 45s | Avg:  4m 26s | Max:  8m 31s | Hits: 100%/2448  
      🟩 GCC10              Pass: 100%/2   | Total:  7m 13s | Avg:  3m 36s | Max:  3m 52s | Hits:  91%/1226  
      🟩 GCC11              Pass: 100%/1   | Total:  4m 04s | Avg:  4m 04s | Max:  4m 04s | Hits:  99%/612   
      🟩 GCC12              Pass: 100%/1   | Total:  3m 57s | Avg:  3m 57s | Max:  3m 57s | Hits:  99%/612   
      🟩 GCC13              Pass: 100%/8   | Total: 40m 49s | Avg:  5m 06s | Max:  9m 49s | Hits:  99%/4896  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 10m 33s | Avg: 10m 33s | Max: 10m 33s | Hits:  95%/309   
      🟩 MSVC14.43          Pass: 100%/3   | Total: 32m 43s | Avg: 10m 54s | Max: 11m 31s | Hits:  95%/933   
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 13m 44s | Avg:  6m 52s | Max:  7m 04s | Hits:  97%/1220  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/10  | Total: 37m 55s | Avg:  3m 47s | Max:  8m 31s | Hits:  98%/6122  
      🟩 GCC                Pass: 100%/12  | Total: 56m 03s | Avg:  4m 40s | Max:  9m 49s | Hits:  98%/7346  
      🟩 MSVC               Pass: 100%/4   | Total: 43m 16s | Avg: 10m 49s | Max: 11m 31s | Hits:  95%/1242  
      🟩 NVHPC              Pass: 100%/2   | Total: 13m 44s | Avg:  6m 52s | Max:  7m 04s | Hits:  97%/1220  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 12m 23s | Avg:  6m 11s | Max:  9m 06s | Hits:  99%/1224  
      🟩 rtx2080            Pass: 100%/26  | Total:  2h 18m | Avg:  5m 19s | Max: 11m 31s | Hits:  97%/14706 
    🟩 jobs
      🟩 Build              Pass: 100%/25  | Total:  2h 03m | Avg:  4m 56s | Max: 11m 31s | Hits:  97%/14094 
      🟩 Test               Pass: 100%/3   | Total: 27m 26s | Avg:  9m 08s | Max:  9m 49s | Hits:  99%/1836  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 12m 23s | Avg:  6m 11s | Max:  9m 06s | Hits:  99%/1224  
      🟩 90;90a             Pass: 100%/2   | Total: 13m 49s | Avg:  6m 54s | Max:  9m 53s | Hits:  98%/923   
      🟩 100;120            Pass: 100%/2   | Total: 15m 06s | Avg:  7m 33s | Max: 11m 31s | Hits:  98%/923   
    🟩 std
      🟩 17                 Pass: 100%/3   | Total: 13m 24s | Avg:  4m 28s | Max:  7m 04s | Hits:  99%/1834  
      🟩 20                 Pass: 100%/25  | Total:  2h 17m | Avg:  5m 30s | Max: 11m 31s | Hits:  97%/14096 
    
  • 🟩 packaging: Pass: 100%/4 | Total: 15m 34s | Avg: 3m 53s | Max: 4m 33s

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total: 15m 34s | Avg:  3m 53s | Max:  4m 33s
    🟩 ctk
      🟩 12.0               Pass: 100%/2   | Total:  8m 12s | Avg:  4m 06s | Max:  4m 33s
      🟩 12.9               Pass: 100%/2   | Total:  7m 22s | Avg:  3m 41s | Max:  3m 58s
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/2   | Total:  8m 12s | Avg:  4m 06s | Max:  4m 33s
      🟩 nvcc12.9           Pass: 100%/2   | Total:  7m 22s | Avg:  3m 41s | Max:  3m 58s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 15m 34s | Avg:  3m 53s | Max:  4m 33s
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  4m 33s | Avg:  4m 33s | Max:  4m 33s
      🟩 Clang19            Pass: 100%/1   | Total:  3m 58s | Avg:  3m 58s | Max:  3m 58s
      🟩 GCC12              Pass: 100%/1   | Total:  3m 39s | Avg:  3m 39s | Max:  3m 39s
      🟩 GCC13              Pass: 100%/1   | Total:  3m 24s | Avg:  3m 24s | Max:  3m 24s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/2   | Total:  8m 31s | Avg:  4m 15s | Max:  4m 33s
      🟩 GCC                Pass: 100%/2   | Total:  7m 03s | Avg:  3m 31s | Max:  3m 39s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 15m 34s | Avg:  3m 53s | Max:  4m 33s
    🟩 jobs
      🟩 Test               Pass: 100%/4   | Total: 15m 34s | Avg:  3m 53s | Max:  4m 33s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 32)

# Runner
17 linux-amd64-cpu16
6 linux-amd64-gpu-rtx2080-latest-1
4 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1

@caugonnet caugonnet changed the title [STF] Allow CUfunction (driver API) in the cuda_kernel(_chain) API [STF] Allow CUfunction/CUkernel (driver API) in the cuda_kernel(_chain) API Jul 19, 2025
@andralex
Copy link
Contributor

Made a pass - thanks @caugonnet and thanks @davebayer for the good points

@caugonnet
Copy link
Contributor Author

/ok to test 76a9a45

@caugonnet
Copy link
Contributor Author

/ok to test 01d638f

Copy link
Contributor

🟩 CI finished in 31m 09s: Pass: 100%/32 | Total: 7h 06m | Avg: 13m 19s | Max: 30m 47s | Hits: 75%/15930
  • 🟩 cudax: Pass: 100%/28 | Total: 6h 52m | Avg: 14m 44s | Max: 30m 47s | Hits: 75%/15930

    🟩 cpu
      🟩 amd64              Pass: 100%/24  | Total:  5h 57m | Avg: 14m 53s | Max: 30m 47s | Hits:  75%/13482 
      🟩 arm64              Pass: 100%/4   | Total: 55m 09s | Avg: 13m 47s | Max: 15m 11s | Hits:  70%/2448  
    🟩 ctk
      🟩 12.0               Pass: 100%/3   | Total: 36m 55s | Avg: 12m 18s | Max: 14m 27s | Hits:  75%/1533  
      🟩 12.9               Pass: 100%/25  | Total:  6h 15m | Avg: 15m 01s | Max: 30m 47s | Hits:  74%/14397 
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/3   | Total: 36m 55s | Avg: 12m 18s | Max: 14m 27s | Hits:  75%/1533  
      🟩 nvcc12.9           Pass: 100%/25  | Total:  6h 15m | Avg: 15m 01s | Max: 30m 47s | Hits:  74%/14397 
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/28  | Total:  6h 52m | Avg: 14m 44s | Max: 30m 47s | Hits:  75%/15930 
    🟩 cxx
      🟩 Clang14            Pass: 100%/2   | Total: 25m 55s | Avg: 12m 57s | Max: 14m 31s | Hits:  70%/1226  
      🟩 Clang15            Pass: 100%/1   | Total: 15m 15s | Avg: 15m 15s | Max: 15m 15s | Hits:  70%/612   
      🟩 Clang16            Pass: 100%/1   | Total: 15m 35s | Avg: 15m 35s | Max: 15m 35s | Hits:  70%/612   
      🟩 Clang17            Pass: 100%/1   | Total: 15m 04s | Avg: 15m 04s | Max: 15m 04s | Hits:  70%/612   
      🟩 Clang18            Pass: 100%/1   | Total: 14m 36s | Avg: 14m 36s | Max: 14m 36s | Hits:  70%/612   
      🟩 Clang19            Pass: 100%/4   | Total: 48m 39s | Avg: 12m 09s | Max: 14m 40s | Hits:  77%/2448  
      🟩 GCC10              Pass: 100%/2   | Total: 30m 46s | Avg: 15m 23s | Max: 16m 19s | Hits:  70%/1226  
      🟩 GCC11              Pass: 100%/1   | Total: 15m 10s | Avg: 15m 10s | Max: 15m 10s | Hits:  69%/612   
      🟩 GCC12              Pass: 100%/1   | Total: 17m 38s | Avg: 17m 38s | Max: 17m 38s | Hits:  69%/612   
      🟩 GCC13              Pass: 100%/8   | Total:  1h 49m | Avg: 13m 39s | Max: 18m 04s | Hits:  76%/4896  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 11m 04s | Avg: 11m 04s | Max: 11m 04s | Hits:  95%/309   
      🟩 MSVC14.43          Pass: 100%/3   | Total: 33m 59s | Avg: 11m 19s | Max: 11m 37s | Hits:  95%/933   
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 59m 40s | Avg: 29m 50s | Max: 30m 47s | Hits:  67%/1220  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/10  | Total:  2h 15m | Avg: 13m 30s | Max: 15m 35s | Hits:  73%/6122  
      🟩 GCC                Pass: 100%/12  | Total:  2h 52m | Avg: 14m 24s | Max: 18m 04s | Hits:  74%/7346  
      🟩 MSVC               Pass: 100%/4   | Total: 45m 03s | Avg: 11m 15s | Max: 11m 37s | Hits:  95%/1242  
      🟩 NVHPC              Pass: 100%/2   | Total: 59m 40s | Avg: 29m 50s | Max: 30m 47s | Hits:  67%/1220  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 22m 43s | Avg: 11m 21s | Max: 11m 25s | Hits:  80%/1224  
      🟩 rtx2080            Pass: 100%/26  | Total:  6h 29m | Avg: 14m 59s | Max: 30m 47s | Hits:  74%/14706 
    🟩 jobs
      🟩 Build              Pass: 100%/25  | Total:  6h 22m | Avg: 15m 16s | Max: 30m 47s | Hits:  72%/14094 
      🟩 Test               Pass: 100%/3   | Total: 30m 38s | Avg: 10m 12s | Max: 11m 19s | Hits:  96%/1836  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 22m 43s | Avg: 11m 21s | Max: 11m 25s | Hits:  80%/1224  
      🟩 90;90a             Pass: 100%/2   | Total: 24m 31s | Avg: 12m 15s | Max: 13m 42s | Hits:  78%/923   
      🟩 100;120            Pass: 100%/2   | Total: 25m 56s | Avg: 12m 58s | Max: 14m 19s | Hits:  78%/923   
    🟩 std
      🟩 17                 Pass: 100%/3   | Total: 57m 52s | Avg: 19m 17s | Max: 30m 47s | Hits:  69%/1834  
      🟩 20                 Pass: 100%/25  | Total:  5h 54m | Avg: 14m 11s | Max: 28m 53s | Hits:  75%/14096 
    
  • 🟩 packaging: Pass: 100%/4 | Total: 13m 53s | Avg: 3m 28s | Max: 3m 42s

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total: 13m 53s | Avg:  3m 28s | Max:  3m 42s
    🟩 ctk
      🟩 12.0               Pass: 100%/2   | Total:  6m 52s | Avg:  3m 26s | Max:  3m 27s
      🟩 12.9               Pass: 100%/2   | Total:  7m 01s | Avg:  3m 30s | Max:  3m 42s
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/2   | Total:  6m 52s | Avg:  3m 26s | Max:  3m 27s
      🟩 nvcc12.9           Pass: 100%/2   | Total:  7m 01s | Avg:  3m 30s | Max:  3m 42s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 13m 53s | Avg:  3m 28s | Max:  3m 42s
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  3m 27s | Avg:  3m 27s | Max:  3m 27s
      🟩 Clang19            Pass: 100%/1   | Total:  3m 42s | Avg:  3m 42s | Max:  3m 42s
      🟩 GCC12              Pass: 100%/1   | Total:  3m 25s | Avg:  3m 25s | Max:  3m 25s
      🟩 GCC13              Pass: 100%/1   | Total:  3m 19s | Avg:  3m 19s | Max:  3m 19s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/2   | Total:  7m 09s | Avg:  3m 34s | Max:  3m 42s
      🟩 GCC                Pass: 100%/2   | Total:  6m 44s | Avg:  3m 22s | Max:  3m 25s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 13m 53s | Avg:  3m 28s | Max:  3m 42s
    🟩 jobs
      🟩 Test               Pass: 100%/4   | Total: 13m 53s | Avg:  3m 28s | Max:  3m 42s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 32)

# Runner
17 linux-amd64-cpu16
6 linux-amd64-gpu-rtx2080-latest-1
4 linux-arm64-cpu16
4 windows-amd64-cpu16
1 linux-amd64-gpu-h100-latest-1

@caugonnet
Copy link
Contributor Author

/ok to test 4c8cfaf

@caugonnet caugonnet requested a review from davebayer July 23, 2025 15:22
Copy link
Contributor

🟩 CI finished in 31m 42s: Pass: 100%/32 | Total: 7h 07m | Avg: 13m 22s | Max: 29m 09s | Hits: 75%/15930
  • 🟩 cudax: Pass: 100%/28 | Total: 6h 54m | Avg: 14m 47s | Max: 29m 09s | Hits: 75%/15930

    🟩 cpu
      🟩 amd64              Pass: 100%/24  | Total:  5h 58m | Avg: 14m 55s | Max: 29m 09s | Hits:  76%/13482 
      🟩 arm64              Pass: 100%/4   | Total: 55m 47s | Avg: 13m 56s | Max: 16m 08s | Hits:  70%/2448  
    🟩 ctk
      🟩 12.0               Pass: 100%/3   | Total: 38m 13s | Avg: 12m 44s | Max: 14m 30s | Hits:  75%/1533  
      🟩 12.9               Pass: 100%/25  | Total:  6h 15m | Avg: 15m 01s | Max: 29m 09s | Hits:  75%/14397 
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/3   | Total: 38m 13s | Avg: 12m 44s | Max: 14m 30s | Hits:  75%/1533  
      🟩 nvcc12.9           Pass: 100%/25  | Total:  6h 15m | Avg: 15m 01s | Max: 29m 09s | Hits:  75%/14397 
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/28  | Total:  6h 54m | Avg: 14m 47s | Max: 29m 09s | Hits:  75%/15930 
    🟩 cxx
      🟩 Clang14            Pass: 100%/2   | Total: 27m 00s | Avg: 13m 30s | Max: 14m 54s | Hits:  70%/1226  
      🟩 Clang15            Pass: 100%/1   | Total: 14m 38s | Avg: 14m 38s | Max: 14m 38s | Hits:  70%/612   
      🟩 Clang16            Pass: 100%/1   | Total: 15m 54s | Avg: 15m 54s | Max: 15m 54s | Hits:  70%/612   
      🟩 Clang17            Pass: 100%/1   | Total: 15m 04s | Avg: 15m 04s | Max: 15m 04s | Hits:  70%/612   
      🟩 Clang18            Pass: 100%/1   | Total: 14m 57s | Avg: 14m 57s | Max: 14m 57s | Hits:  70%/612   
      🟩 Clang19            Pass: 100%/4   | Total: 49m 03s | Avg: 12m 15s | Max: 14m 57s | Hits:  77%/2448  
      🟩 GCC10              Pass: 100%/2   | Total: 29m 37s | Avg: 14m 48s | Max: 15m 07s | Hits:  70%/1226  
      🟩 GCC11              Pass: 100%/1   | Total: 16m 54s | Avg: 16m 54s | Max: 16m 54s | Hits:  69%/612   
      🟩 GCC12              Pass: 100%/1   | Total: 19m 20s | Avg: 19m 20s | Max: 19m 20s | Hits:  69%/612   
      🟩 GCC13              Pass: 100%/8   | Total:  1h 48m | Avg: 13m 33s | Max: 16m 53s | Hits:  77%/4896  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 11m 37s | Avg: 11m 37s | Max: 11m 37s | Hits:  96%/309   
      🟩 MSVC14.43          Pass: 100%/3   | Total: 34m 01s | Avg: 11m 20s | Max: 11m 32s | Hits:  95%/933   
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 57m 25s | Avg: 28m 42s | Max: 29m 09s | Hits:  67%/1220  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/10  | Total:  2h 16m | Avg: 13m 39s | Max: 15m 54s | Hits:  73%/6122  
      🟩 GCC                Pass: 100%/12  | Total:  2h 54m | Avg: 14m 31s | Max: 19m 20s | Hits:  74%/7346  
      🟩 MSVC               Pass: 100%/4   | Total: 45m 38s | Avg: 11m 24s | Max: 11m 37s | Hits:  95%/1242  
      🟩 NVHPC              Pass: 100%/2   | Total: 57m 25s | Avg: 28m 42s | Max: 29m 09s | Hits:  67%/1220  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 20m 31s | Avg: 10m 15s | Max: 11m 26s | Hits:  84%/1224  
      🟩 rtx2080            Pass: 100%/26  | Total:  6h 33m | Avg: 15m 08s | Max: 29m 09s | Hits:  74%/14706 
    🟩 jobs
      🟩 Build              Pass: 100%/25  | Total:  6h 24m | Avg: 15m 22s | Max: 29m 09s | Hits:  72%/14094 
      🟩 Test               Pass: 100%/3   | Total: 29m 28s | Avg:  9m 49s | Max: 12m 18s | Hits:  99%/1836  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 20m 31s | Avg: 10m 15s | Max: 11m 26s | Hits:  84%/1224  
      🟩 90;90a             Pass: 100%/2   | Total: 25m 44s | Avg: 12m 52s | Max: 14m 37s | Hits:  78%/923   
      🟩 100;120            Pass: 100%/2   | Total: 25m 48s | Avg: 12m 54s | Max: 14m 26s | Hits:  78%/923   
    🟩 std
      🟩 17                 Pass: 100%/3   | Total: 55m 37s | Avg: 18m 32s | Max: 29m 09s | Hits:  69%/1834  
      🟩 20                 Pass: 100%/25  | Total:  5h 58m | Avg: 14m 20s | Max: 28m 16s | Hits:  76%/14096 
    
  • 🟩 packaging: Pass: 100%/4 | Total: 13m 46s | Avg: 3m 26s | Max: 3m 35s

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total: 13m 46s | Avg:  3m 26s | Max:  3m 35s
    🟩 ctk
      🟩 12.0               Pass: 100%/2   | Total:  6m 44s | Avg:  3m 22s | Max:  3m 28s
      🟩 12.9               Pass: 100%/2   | Total:  7m 02s | Avg:  3m 31s | Max:  3m 35s
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/2   | Total:  6m 44s | Avg:  3m 22s | Max:  3m 28s
      🟩 nvcc12.9           Pass: 100%/2   | Total:  7m 02s | Avg:  3m 31s | Max:  3m 35s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 13m 46s | Avg:  3m 26s | Max:  3m 35s
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  3m 28s | Avg:  3m 28s | Max:  3m 28s
      🟩 Clang19            Pass: 100%/1   | Total:  3m 27s | Avg:  3m 27s | Max:  3m 27s
      🟩 GCC12              Pass: 100%/1   | Total:  3m 16s | Avg:  3m 16s | Max:  3m 16s
      🟩 GCC13              Pass: 100%/1   | Total:  3m 35s | Avg:  3m 35s | Max:  3m 35s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/2   | Total:  6m 55s | Avg:  3m 27s | Max:  3m 28s
      🟩 GCC                Pass: 100%/2   | Total:  6m 51s | Avg:  3m 25s | Max:  3m 35s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 13m 46s | Avg:  3m 26s | Max:  3m 35s
    🟩 jobs
      🟩 Test               Pass: 100%/4   | Total: 13m 46s | Avg:  3m 26s | Max:  3m 35s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- CCCL Packaging
libcu++
CUB
Thrust
+/- CUDA Experimental
stdpar
python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 32)

# Runner
17 linux-amd64-cpu16
6 linux-amd64-gpu-rtx2080-latest-1
4 linux-arm64-cpu16
4 windows-amd64-cpu16
1 linux-amd64-gpu-h100-latest-1

{

template <typename T>
inline constexpr bool is_function_or_kernel_v = ::std::is_same_v<T, CUfunction> || ::std::is_same_v<T, CUkernel>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: wouldn't it make sense to call it is_cufunction_or_cukernel_v?

Comment on lines +183 to +189
auto* ker_ptr = ::std::get_if<CUfunction>(&func_variant);
if (!ker_ptr)
{
// If this is a CUkernel, the cast to a CUfunction is sufficient
ker_ptr = reinterpret_cast<const CUfunction*>(::std::get_if<CUkernel>(&func_variant));
}
return cuda_try<cuFuncGetAttribute>(CU_FUNC_ATTRIBUTE_NUM_REGS, *ker_ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I am not sure whether this is correct. The documentation explicitly sais that cuLauchKernel can be used with (CUfunction)cukernel, but there is no such a note for cuFuncGetAttribute. Especially when there is a cuKernelGetAttribute function, that takes CUkernel + CUdevice arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that CUkernel is really transformed into the same as a CUfunction so you can call methods for CUfunction on it, the cast might do the work of getting the underlying current device

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have serios doubts here. Why would functions like cuKernelGetFution or cuKernelSetAttribute exist then? It may be possible that they do exactly what you say now, but I believe the right thing is to use the cuKernelXxx functions for CUkernel and cuFuncXxx function for CUfunction.

If you don't want to use them, you should verify this internally :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you are not necessarily in the device context of the device for which you try to extract the CUfunction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davebayer what i'll do is that i'll add a new unit test then, checking if the driver API and runtime API get the same number for example

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's all possible. My only concern is that are we sure this thing won't change in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stf Sequential Task Flow programming model
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

4 participants