Tags · NVIDIA/cudnn-frontend

## cudnn frontend v1.11 release notes
cudnn frontend v1.11 is the preferred cudnn frontend version for cudnn version 9.8.0 and above. With cuDNN frontend v1.11, the minimum supported cudnn version is 9.0.0.

## New API

- cudnn frontend v1.11 release flexible score modifier to the python SDPA API. Samples showcasing soft cap of the attention scores, arrow mask are available in the [cudnn_frontend/test/python/test_flexible_sdpa.py](https://github.com/NVIDIA/cuDNN-frontend/blob/main/cudnn_frontend/test/python/test_flexible_sdpa.py) file.
A sample usage of score modifier is shown below:
```
score_mod=partial(
custom_mask,
mod_tensor=mod_tensor,
neg_inf=neg_inf_tensor,
seq_len_q=seq_len_q,
seq_len_kv=seq_len_kv,
)
```

- The Concatenate operation merges two or more tensors into one, along the specified axis. The user may also specify an in-place merge.
```
std::shared_ptr<Tensor_attributes>
concatenate(std::vector<std::shared_ptr<Tensor_attributes>>, Concatenate_attributes);
```

- pip wheels compatible with windows x86_64 architecture are now available on [pypi](https://pypi.org/project/nvidia-cudnn-frontend/).

- sdpa paged attention API now supports Q tensor to be ragged when used with cudnn version 9.7.0 and above.

## Improvements

- Users can now pass the CMake flag `-DCMAKE_CXX_FLAGS="-DNV_CUDNN_FRONTEND_DISABLE_LOGGING"` to disable logging in the cuDNN frontend.

- Added a new sample to showcase native cudagraph creation from cudnn for sdpa bprop operation. Fixed a bug when using the update_cuda_graph API to update cuda graph for sdpa bprop operation.

## Bug Fixes

- Fixed memory leak in the test harness for some legacy tests that use ragged tensors.

- Fixed a bug introduced in the benchmarking script that prevented the sdpa cudnn operation from being executed. This was because the `use_padding_mask` attribute was made mandatory for the sdpa operation. This has been fixed as well.

- Updated the paged attention sample to not cause illegal memory access when changing the dimensions of the tensors in the sample.

- Updated the DgradDReluBNBwdWeight sample to perform the right operation for the dgrad + drelu fusion.

Mar 20, 2025
8801fd7
zip
tar.gz
Notes

v1.10.0

# cudnn frontend v1.10 release notes (#126)

cudnn frontend v1.10 is the preferred cudnn frontend to be used for
cudnn backend 9.7.0 and later as it adds to the Blackwell specific
features.
## New API
- cudnn Frontend v1.10 introduces two new operators,
block_scale_quantize and block_scale_dequantize to specify the scaling
and de-scaling of low precision datatypes supported from Blackwell GPU
onwards.

- `create_execution_plan(int64_t const engine_id,
std::unordered_map<KnobType_t, int64_t> const &knobs)` allows creation
of a custom execution plan with hardcoded engine and knobs. Added a
sample in `samples/cpp/misc/custom_plan.cpp` to showcase how to work
with different `Engine` and `Knobs`.

## Improvements
- Users can now query behavior notes of a particular execution plan
using `get_behavior_notes(std::vector<BehaviorNote_t> &notes) const` and
`get_behavior_notes_for_plan_at_index(int64_t const index,
std::vector<BehaviorNote_t> &notes) const` functions.

- SDPA operations now accept both left window and right window size with
respect to diagonal. See Attention.md for more details.

- SDPA operations now accept a diagonal alignment for the Attention
score matrix to be used describe the above window. When `s_q != s_kv`,
and causal mask is on this can be used to specify if the diagonal is top
left or bottom right.

- Bottom right causal masking can now be enabled on the sdpa_fp8
operation.

## Bug fixes
- Fixed a regression in cuDNN FrontEnd v1.9.0 where the softmax node
would override user-set dims and strides for softmax_stats and m_zinv.
This also affected sdpa_forward and sdpa_fp8_forward node

## New samples
- Added an example to showcase how native cuda graphs can be constructed
from the SDPA operation graph.

Jan 28, 2025
91b7532
zip
tar.gz
Notes

v1.9.0

# cudnn frontend v1.9 release notes (#123)

## New API

### cudnn Flex Attention

`SDPA_attributes` and `SDPA_bprop_attributes` now accepts a score_mod function through `set_score_mod` and `set_score_mod_bprop` API. The function accepts a custom chain of pointwise operations which operate on the Attention Score Matrix. Some common functors like causal mask, sliding window mask, soft capping etc. have been added to the headers as reference. More examples of usage have been added in samples for [fprop](fp16_fwd_with_flexible_graphs.cpp) and [bprop](fp16_bwd_with_flexible_graphs.cpp).

### Improvements

- Added support for THD format and sliding window mask.

- Added support for THD format and Bottom right causal mask.

- Added a new parameter called `set_max_total_seq_len_q/set_max_total_seq_len_kv` on the sdpa bprop node. This will help reduce the workspace size required when running with THD format.

- Allow creation of serialized json for dgrad, wgrad and resample operations.

- Added more diagonstic message when the compiled version of cudnn does not match the run-time version of cudnn.

### Bug fixes

- Fixed an issue where log messages unparseable data at the end of messages.

- Fixed an issue where while building the python pip wheel would hang.

- Fixed natively creating cuda graphs for SDPA with alibi masks.

### New samples

- Added a new sample for Layernorm with dynamic shapes and a kernel cache to showcase reduced plan build time when using the kernel cache.

Dec 20, 2024
ee971b1
zip
tar.gz
Notes

v1.8.0

# cudnn frontend v1.8 release notes (#118)

## New API

### Paged Attention API
SDPA forward operation now supports paged attention on cudnn 9.5.0 and
later by setting the appropriate page-table descriptors.
`SDPA_attributes` now accept `set_paged_attention_k_table` and
`set_paged_attention_v_table` to input this descriptor. Please refer to
samples for usage : [cpp
samples](samples/cpp/sdpa/fp16_fwd_with_paged_caches.cpp), [python
samples](samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb).
See [docs](docs/operations/Attention.md) for more API details.

### cuda Graph API
cudnn graph now allows user to directly build native cuda_graph for
given sub_graph (requires cudnn 9.5.0). There are two APIs:
 - `populate_cuda_graph` : add the cudnn nodes to the empty cuda_graph
 provided as input.
  - `update_cuda_graph` : update the populated cuda graph with necessary
  data pointers.
  See [docs](docs/cuda_graphs.md) and [backend
  documentation](https://docs.nvidia.com/deeplearning/cudnn/latest/api/cudnn-graph-library.html#cudnnbackendpopulatecudagraph)
  for more details.

### Enhancements

- Kernel cache for dynamic shapes are now supported in python. Added a
[sample](test/python/test_kernel_cache.py) to showcase usage.

- `graph.deselect_engines(str: )` has now a python equivalent through
pybind11.

- `graph.tensor(...)` can now accept `int64_t` scalars directly.
(Previously limited to int32_t, float and fp16 data types).

- fp8 sdpa attention now allows dropout and padding mask. Requires cudnn
9.5.0 and above.

- More enhancements to pointwise output stride inferencing (for
broadcast operation). For non-unary operands, the broadcasted tensor can
now be either at IN_0 or IN_1.

- SDPA backward operation now allows d upto 256 for Hopper. Requires
cudnn 9.5.0 and above.

### Bug fixes

- Fixed an issue while querying `cudnnGetLastErrorString()` from the
backend. The error_t object will now have more meaningful message.

- Fixed build issues seen with clang-19 compiler.

- Fixed an issue where it was assumed a graph with bias in sdpa_bprop
will always have a dbias.

Oct 23, 2024
936021b
zip
tar.gz
Notes

v1.7.0

# cudnn FE 1.7.0 Release notes: (#111)

## New API

- Kernel Cache support for dynamic graphs
Added New APIs to enable kernel cache support for graphs with dynamic shapes. Please refer to [documentation](docs/dynamic_kernel_cache.md) for API details.

Added examples `Convolution fprop dynamic shape`, `CSBR Graph dynamic shape`, `Matmul dynamic shape` and `Bias + Matmul dynamic shape` to showcase use of dynamic shapes and kernel cache.

- Two new APIs to describe the plan in the form engine number and knobs are introduced.
```
error_t
get_plan_name(std::string &name) const;

error_t
get_plan_name_at_index(int64_t plan_index, std::string &name) const;
```
Note:
This name can be used later if you want to deselect_plan_by_name, if run into any potential errors.

- Added an API to query tensor attributes from its UID in a graph.
`query_tensor_with_uid(int64_t const uid, Tensor_attributes &tensor) const;`

## Improvements

- sdpa fp16 bprop node can now compute dbias when padding mask is enabled.

- sdpa fp8 (forward and bprop) nodes now support optional bias, dropout and padding mask.

- Matmul fp8 node can now accept M,N,K overrides.

- Added new python notebooks for implementing BatchNorm and BatchNorm bprop using cuDNN.

- Updated [benchmark numbers](benchmark) with cudnn 9.4.0 for fp16 and fp8 datatypes.

- Fixed compilation issues when `NV_CUDNN_DISABLE_EXCEPTION` is enabled.

## Bug fixes

- Fixed a crash when the output dimension of dgrad node is not specified. This now returns an error message instead.

- Fixed incorrect SDPA stats stride inferencing.

- Fixed a bug in sdpa test when sliding window attention is enabled and query sequence length (s_q) is greater than key length (s_kv). This case is now not supported.

Sep 23, 2024
de355c7
zip
tar.gz
Notes

v1.6.1

- cudnn FE 1.6.1 release (#99)

-  Bug fix

- Fixed an issue where custom dropout mask was not correctly applied.
- Added `-fvisibility=hidden` for the pip wheels generated to avoid
symbol conflicts with other modules that use cudnn frontend.
- Fixed an issue in sdpa kernels which will lead to numerical
mismatches.
- Fixed an issue in sdpa fp8 fprop kernels (in inference mode)

- Samples

- Added a new sample to showcase how a custom dropout mask can be
applied to a sdpa operation.
- Added a sample to shocase convolutions on large (`c * d * h * w > 2 **
31`) tensors.

Aug 20, 2024
2533f5e
zip
tar.gz
Notes

v1.6.0

v1.6.0 release

New API
- Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added.
- SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED.
Bug Fixes
- Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API.
- SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node.
Enhancements
- Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size.
- Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph).
- Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later.
- Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input.
- Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph.
- JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks.
- Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls.
- CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details.
Samples
- Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.

Aug 12, 2024
23511ba
zip
tar.gz
Notes

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.13.0

v1.12.1

v1.12.0

v1.11.0