Tags: NVIDIA/cudnn-frontend
Tags
cudnn frontend v1.13.0 (#150) cudnn frontend v1.13 is the preferred cudnn frontend version for [cudnn version 9.11.0](https://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html#cudnn-9-11-0) and above. Introduces device descriptor, which allows for device-less compilation of cudnn graph on a target GPU. See newly added [sample](samples/cpp/misc/deviceless_aot_compilation.cpp) and documentation. - Introduced `generate_stats` as an alias to `is_inference`. `generate_stats` will be used to control the stat tensor dump. `is_inference` is now deprecated usage. - Improved support checks for left and right diagonal bands in conjunction with the diagonal alignment. - Improved error handling for large head dimension (d > 128) in sdpa bprop. - Added support for fused Layernorm with Relu and samples for [Layernorm with relu bitmask dump](samples/cpp/norm/layernorm_bitmask_relu.cpp) - Published improved SDPA training benchmarks for fp8 and fp16/bf16 graph patterns. - Enable int4 Weight only Quantization for matmul. See [example](samples/cpp/int4_woq_matmul.cpp) - Allow block scale dequantize (required for low precision matmul) to take 2-D scale factor. - Allow reductions to accept deterministic as a attribute. - Added pybinds for block scale dequantize. - Fixed the sliding window attn_score_modifier function allowing it to set true negative infinity.
Benchmark results for sdpa operation on Blackwell (#146) The benchmarking script in this current directory profiles scaled dot product attention (SDPA) from various backends. Here we benchmark attention layer dimensions inspired by Llama-3.1-405B with sequence lengths ranging from 512 to 131,072.
# v1.12.0 release (#141) ## cudnn frontend v1.12 release notes cudnn frontend v1.12 is the preferred cudnn frontend version for cudnn version 9.9.0 and above. cudnn_frontend v1.12 is the minimum cudnn frontend version required to work with cuda 13.0 and above Update the dlpack version and cmake minimum required version to be 3.18 ## New API - Allows compilation and loading of cudnn frontend with cudnn-jit packages. - Introduce Adaptive Layernorm (fprop and bprop) operation in cudnn. ``` std::array<std::shared_ptr<Tensor_attributes>, 3> adalayernorm(std::shared_ptr<Tensor_attributes>& input, std::shared_ptr<Tensor_attributes>& scale, std::shared_ptr<Tensor_attributes>& bias, AdaLayernorm_attributes attributes); std::array<std::shared_ptr<Tensor_attributes>, 3> adalayernorm_backward( std::shared_ptr<Tensor_attributes> dy, std::shared_ptr<Tensor_attributes> x, std::shared_ptr<Tensor_attributes> scale, AdaLayernorm_backward_attributes options); ``` Please refer to [samples](samples/cpp/norm/adaptive_layernorm.cpp) for usage. - cudnn frontend python API introduces two decorator function `cudnn.jit` and `cudnn.graph` for simpler graph creation in python. Refer the [matmul sample](samples/python/01_matmul_bias.ipynb) for usage. ## Improvements ### SDPA - Allows large embedded dimension (d > 128) for fprop across Ampere, Hopper, and Blackwell architectures for bf16/fp16. - Added better validation checks for sliding window attention for cudnn version 9.9.0 and below. - Sliding windown attention now supports cases when s_q > s_kv - sdpa_fp8 operation now pads correctly with negative infinity on masking operation rather than high negative value. This improves the numerical stability of the sdpa operation with fp8 data type. - Paged attention now supports page tables in a packed format ### Normalizations - Allow zero-centered scale in layer norm. Refer to this [sample](samples/cpp/norm/norm_zero_centered_gamma.cpp) for usage. ### Others - cudnn frontend now supports serialization of dynamic kernel cache. ## Bug Fixes - Fixed the dlopen of cudart.so to look for the binary with version name. - Correctly fail when SDPA bprop is called on Blackwell with embedded dimension (d) > 128.
# v1.11.0 release (#136) ## cudnn frontend v1.11 release notes cudnn frontend v1.11 is the preferred cudnn frontend version for cudnn version 9.8.0 and above. With cuDNN frontend v1.11, the minimum supported cudnn version is 9.0.0. ## New API - cudnn frontend v1.11 release flexible score modifier to the python SDPA API. Samples showcasing soft cap of the attention scores, arrow mask are available in the [cudnn_frontend/test/python/test_flexible_sdpa.py](https://github.com/NVIDIA/cuDNN-frontend/blob/main/cudnn_frontend/test/python/test_flexible_sdpa.py) file. A sample usage of score modifier is shown below: ``` score_mod=partial( custom_mask, mod_tensor=mod_tensor, neg_inf=neg_inf_tensor, seq_len_q=seq_len_q, seq_len_kv=seq_len_kv, ) ``` - The Concatenate operation merges two or more tensors into one, along the specified axis. The user may also specify an in-place merge. ``` std::shared_ptr<Tensor_attributes> concatenate(std::vector<std::shared_ptr<Tensor_attributes>>, Concatenate_attributes); ``` - pip wheels compatible with windows x86_64 architecture are now available on [pypi](https://pypi.org/project/nvidia-cudnn-frontend/). - sdpa paged attention API now supports Q tensor to be ragged when used with cudnn version 9.7.0 and above. ## Improvements - Users can now pass the CMake flag `-DCMAKE_CXX_FLAGS="-DNV_CUDNN_FRONTEND_DISABLE_LOGGING"` to disable logging in the cuDNN frontend. - Added a new sample to showcase native cudagraph creation from cudnn for sdpa bprop operation. Fixed a bug when using the update_cuda_graph API to update cuda graph for sdpa bprop operation. ## Bug Fixes - Fixed memory leak in the test harness for some legacy tests that use ragged tensors. - Fixed a bug introduced in the benchmarking script that prevented the sdpa cudnn operation from being executed. This was because the `use_padding_mask` attribute was made mandatory for the sdpa operation. This has been fixed as well. - Updated the paged attention sample to not cause illegal memory access when changing the dimensions of the tensors in the sample. - Updated the DgradDReluBNBwdWeight sample to perform the right operation for the dgrad + drelu fusion.
# cudnn frontend v1.10 release notes (#126) cudnn frontend v1.10 is the preferred cudnn frontend to be used for cudnn backend 9.7.0 and later as it adds to the Blackwell specific features. ## New API - cudnn Frontend v1.10 introduces two new operators, block_scale_quantize and block_scale_dequantize to specify the scaling and de-scaling of low precision datatypes supported from Blackwell GPU onwards. - `create_execution_plan(int64_t const engine_id, std::unordered_map<KnobType_t, int64_t> const &knobs)` allows creation of a custom execution plan with hardcoded engine and knobs. Added a sample in `samples/cpp/misc/custom_plan.cpp` to showcase how to work with different `Engine` and `Knobs`. ## Improvements - Users can now query behavior notes of a particular execution plan using `get_behavior_notes(std::vector<BehaviorNote_t> ¬es) const` and `get_behavior_notes_for_plan_at_index(int64_t const index, std::vector<BehaviorNote_t> ¬es) const` functions. - SDPA operations now accept both left window and right window size with respect to diagonal. See Attention.md for more details. - SDPA operations now accept a diagonal alignment for the Attention score matrix to be used describe the above window. When `s_q != s_kv`, and causal mask is on this can be used to specify if the diagonal is top left or bottom right. - Bottom right causal masking can now be enabled on the sdpa_fp8 operation. ## Bug fixes - Fixed a regression in cuDNN FrontEnd v1.9.0 where the softmax node would override user-set dims and strides for softmax_stats and m_zinv. This also affected sdpa_forward and sdpa_fp8_forward node ## New samples - Added an example to showcase how native cuda graphs can be constructed from the SDPA operation graph.
# cudnn frontend v1.9 release notes (#123) ## New API ### cudnn Flex Attention `SDPA_attributes` and `SDPA_bprop_attributes` now accepts a score_mod function through `set_score_mod` and `set_score_mod_bprop` API. The function accepts a custom chain of pointwise operations which operate on the Attention Score Matrix. Some common functors like causal mask, sliding window mask, soft capping etc. have been added to the headers as reference. More examples of usage have been added in samples for [fprop](fp16_fwd_with_flexible_graphs.cpp) and [bprop](fp16_bwd_with_flexible_graphs.cpp). ### Improvements - Added support for THD format and sliding window mask. - Added support for THD format and Bottom right causal mask. - Added a new parameter called `set_max_total_seq_len_q/set_max_total_seq_len_kv` on the sdpa bprop node. This will help reduce the workspace size required when running with THD format. - Allow creation of serialized json for dgrad, wgrad and resample operations. - Added more diagonstic message when the compiled version of cudnn does not match the run-time version of cudnn. ### Bug fixes - Fixed an issue where log messages unparseable data at the end of messages. - Fixed an issue where while building the python pip wheel would hang. - Fixed natively creating cuda graphs for SDPA with alibi masks. ### New samples - Added a new sample for Layernorm with dynamic shapes and a kernel cache to showcase reduced plan build time when using the kernel cache.
# cudnn frontend v1.8 release notes (#118) ## New API ### Paged Attention API SDPA forward operation now supports paged attention on cudnn 9.5.0 and later by setting the appropriate page-table descriptors. `SDPA_attributes` now accept `set_paged_attention_k_table` and `set_paged_attention_v_table` to input this descriptor. Please refer to samples for usage : [cpp samples](samples/cpp/sdpa/fp16_fwd_with_paged_caches.cpp), [python samples](samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb). See [docs](docs/operations/Attention.md) for more API details. ### cuda Graph API cudnn graph now allows user to directly build native cuda_graph for given sub_graph (requires cudnn 9.5.0). There are two APIs: - `populate_cuda_graph` : add the cudnn nodes to the empty cuda_graph provided as input. - `update_cuda_graph` : update the populated cuda graph with necessary data pointers. See [docs](docs/cuda_graphs.md) and [backend documentation](https://docs.nvidia.com/deeplearning/cudnn/latest/api/cudnn-graph-library.html#cudnnbackendpopulatecudagraph) for more details. ### Enhancements - Kernel cache for dynamic shapes are now supported in python. Added a [sample](test/python/test_kernel_cache.py) to showcase usage. - `graph.deselect_engines(str: )` has now a python equivalent through pybind11. - `graph.tensor(...)` can now accept `int64_t` scalars directly. (Previously limited to int32_t, float and fp16 data types). - fp8 sdpa attention now allows dropout and padding mask. Requires cudnn 9.5.0 and above. - More enhancements to pointwise output stride inferencing (for broadcast operation). For non-unary operands, the broadcasted tensor can now be either at IN_0 or IN_1. - SDPA backward operation now allows d upto 256 for Hopper. Requires cudnn 9.5.0 and above. ### Bug fixes - Fixed an issue while querying `cudnnGetLastErrorString()` from the backend. The error_t object will now have more meaningful message. - Fixed build issues seen with clang-19 compiler. - Fixed an issue where it was assumed a graph with bias in sdpa_bprop will always have a dbias.
# cudnn FE 1.7.0 Release notes: (#111) ## New API - Kernel Cache support for dynamic graphs Added New APIs to enable kernel cache support for graphs with dynamic shapes. Please refer to [documentation](docs/dynamic_kernel_cache.md) for API details. Added examples `Convolution fprop dynamic shape`, `CSBR Graph dynamic shape`, `Matmul dynamic shape` and `Bias + Matmul dynamic shape` to showcase use of dynamic shapes and kernel cache. - Two new APIs to describe the plan in the form engine number and knobs are introduced. ``` error_t get_plan_name(std::string &name) const; error_t get_plan_name_at_index(int64_t plan_index, std::string &name) const; ``` Note: This name can be used later if you want to deselect_plan_by_name, if run into any potential errors. - Added an API to query tensor attributes from its UID in a graph. `query_tensor_with_uid(int64_t const uid, Tensor_attributes &tensor) const;` ## Improvements - sdpa fp16 bprop node can now compute dbias when padding mask is enabled. - sdpa fp8 (forward and bprop) nodes now support optional bias, dropout and padding mask. - Matmul fp8 node can now accept M,N,K overrides. - Added new python notebooks for implementing BatchNorm and BatchNorm bprop using cuDNN. - Updated [benchmark numbers](benchmark) with cudnn 9.4.0 for fp16 and fp8 datatypes. - Fixed compilation issues when `NV_CUDNN_DISABLE_EXCEPTION` is enabled. ## Bug fixes - Fixed a crash when the output dimension of dgrad node is not specified. This now returns an error message instead. - Fixed incorrect SDPA stats stride inferencing. - Fixed a bug in sdpa test when sliding window attention is enabled and query sequence length (s_q) is greater than key length (s_kv). This case is now not supported.
- cudnn FE 1.6.1 release (#99) - Bug fix - Fixed an issue where custom dropout mask was not correctly applied. - Added `-fvisibility=hidden` for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend. - Fixed an issue in sdpa kernels which will lead to numerical mismatches. - Fixed an issue in sdpa fp8 fprop kernels (in inference mode) - Samples - Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation. - Added a sample to shocase convolutions on large (`c * d * h * w > 2 ** 31`) tensors.
v1.6.0 release New API - Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added. - SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED. Bug Fixes - Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API. - SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node. Enhancements - Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size. - Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph). - Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later. - Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input. - Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph. - JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks. - Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls. - CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details. Samples - Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.
PreviousNext