Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.8.0 update #2082

Merged
merged 2 commits into from
Feb 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
- [Pipelines that implement Blackwell specific synchronization](./include/cutlass/pipeline/sm100_pipeline.hpp).
- [Cluster launch control API supporting preferred and fallback cluster shapes](./include/cutlass/cluster_launch.hpp).
- Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
- Tile schedulers using [Blackwell's Cluster Launch Control (CLC) feature](./cutlass/media/docs/blackwell_cluster_launch_control.md) to implement dynamic persistence scheduling for [GEMMs](./include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp), and [stream-K](./include/cutlass/gemm/kernel/sm100_tile_scheduler_stream_k.hpp).
- Tile schedulers using [Blackwell's Cluster Launch Control (CLC) feature](./media/docs/blackwell_cluster_launch_control.md) to implement dynamic persistence scheduling for [GEMMs](./include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp), and [stream-K](./include/cutlass/gemm/kernel/sm100_tile_scheduler_stream_k.hpp).
- Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
* Full support for Blackwell SM100 kernels in CUTLASS 3.x API:
- [Blackwell specific kernel layers](./include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp) that
Expand All @@ -32,6 +32,7 @@
* CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
- Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
- Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
* New CUTLASS profiler flag `use-cuda-graphs` to reduce overheads when benchmarking launch-bound kernels.
* Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architecture:
- [Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples](./examples/70_blackwell_gemm/), demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API.
- GEMM with [opt-in collective builder schedules showcasing available recipes](./examples/71_blackwell_gemm_with_collective_builder/71_blackwell_gemm_with_collective_builder.cu) for Blackwell.
Expand All @@ -46,14 +47,15 @@
- [Fused multi-head attention fprop kernel](./examples/77_blackwell_fmha/77_blackwell_fmha.cu) supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
* Documentation updates:
- [Quickstart - instantiating a Blackwell block-scaled GEMM](./media/docs/quickstart.md#instantiating-a-blackwell-gemm-kernel).
- Detailed [Blackwell block-scaled GEMM functionality documentation](./media/docs/narrow_and_mixed_precision_gemms.md)
- Detailed [Blackwell block-scaled GEMM functionality documentation](./media/docs/blackwell_functionality.md)
- A new [functionality documentation](./media/docs/functionality.md) specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
- Updates to [compatibility](./README.md#compatibility) section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and [Target Architecture](./README.md#Target-Architecture).
- Support grouped GEMM in the CUTLASS profiler (`./cutlass_profiler --operation=GroupedGemm --help` for details).

## [3.7.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.7.0) (2025-01-11)
- [Hopper blockwise scaling FP8 GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) uses 2D scaling tensor, assigning one value per threadblock. This allows a finer-grained scaling to be applied for each output tile per gemm-k iteration. The operands and scaling tensors are loaded from global memory to shared memory using TMA and cp_async, respectively. The scaling is applied inside the mainloop. Details with figures are [here](https://github.com/NVIDIA/cutlass/pull/1932#issue-2645398439).
- [Distributed GEMM](./examples/65_distributed_gemm/65_distributed_gemm.cu) is a new (experimental) API which can turn existing CUTLASS GEMM kernels into pipelined Tensor Parallel GEMMs that run efficiently on NVLink-based network of GPUs. Its pipelining schedules can hide most of the communication behind computation, and relies on point-to-point communication, which can simply use CUDA runtime's peer device access feature. It also utilizes remote TMA loads and memcopies with CUDA graphs to handle communication primarily through the Copy Engine, leaving all SMs free for Hopper's persistent kernels. For more details you can refer to the [DistGEMM blog post](https://blog.shi-labs.com/distributed-gemm-88be6a481e2b).
- Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new `make_kernel_hardware_info` API as shown in [example 48](./examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu).
- Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new `make_kernel_hardware_info` API as shown in [example 48](./examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu).
- Enabled high precision accumulation for Hopper FP8 Sparse GEMM.
- Potential API breaking changes:
+ Fix `cute::UniversalCopy` for type safety.
Expand Down
101 changes: 95 additions & 6 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,13 @@ set(CUTLASS_TEST_LEVEL "0" CACHE STRING "Level of tests to compile.")
find_package(Python3 3.5 COMPONENTS Interpreter REQUIRED)

################################################################################


include(customConfigs.cmake)

################################################################################


set(CUTLASS_ENABLE_HEADERS_ONLY OFF CACHE BOOL "Enable only the header library")

if(CUTLASS_ENABLE_HEADERS_ONLY)
Expand Down Expand Up @@ -395,12 +402,6 @@ endif()
#
###################################################################################################

if (CUDA_VERSION VERSION_GREATER_EQUAL 12.8)
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUDA_BLACKWELL_TMA_SWIZZLE_ENABLED=1)

list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUDA_ENABLE_PREFERRED_CLUSTER=1)
endif()



# Warnings-as-error exceptions and warning suppressions for Clang builds
Expand Down Expand Up @@ -978,6 +979,94 @@ function(cutlass_add_executable_tests NAME TARGET)

endfunction()



function(cutlass_generate_profiler_tests NAME)

set(options)
set(oneValueArgs)
set(multiValueArgs DEPENDS DEPENDEES CUTLASS_PROFILER_EXTRA_OPTIONS)
cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})

if (NOT CUTLASS_BUILD_FOR_PROFILER_REGRESSIONS AND NOT CUTLASS_BUILD_FOR_PROFILER_PERFORMANCE_REGRESSIONS)
return()
endif()

install(
FILES ${CUTLASS_PROFILER_REGRESSION_LIST_FILE}
DESTINATION ${CMAKE_INSTALL_INFODIR}/cutlass/
RENAME profiler_regressions.csv
)

# Generate cmake test targets for each entry in the testlist csv

if (NOT EXISTS "${CUTLASS_PROFILER_REGRESSION_LIST_FILE}")
message(SEND_ERROR "Profiler unit tests list path is invalid: CUTLASS_PROFILER_REGRESSION_LIST_FILE = ${CUTLASS_PROFILER_REGRESSION_LIST_FILE}")
else()
message(STATUS "Using ${CUTLASS_PROFILER_REGRESSION_LIST_FILE} to generate profiler-based tests.")
endif()

file(STRINGS ${CUTLASS_PROFILER_REGRESSION_LIST_FILE} TEST_LIST)

foreach(TEST IN LISTS TEST_LIST)

if ("${TEST}" MATCHES " *cutlass_profiler.*")

# Generate a flattened name for the test from the test command line.
string(REPLACE "," ";" TEST_NAME_LIST ${TEST})
list(GET TEST_NAME_LIST 0 TEST)
string(REGEX MATCHALL "[a-zA-Z0-9_=]+" TEST_NAME "${TEST}")
list(FILTER TEST_NAME EXCLUDE REGEX "cutlass_profiler|mode=trace|providers=cutlass")
list(JOIN TEST_NAME "_" TEST_NAME)
string(REGEX REPLACE "_verification_required=(true|false)" "" TEST_NAME "${TEST_NAME}")
string(REPLACE "_verification_providers=device" "" TEST_NAME "${TEST_NAME}")
string(REPLACE "batch_count=" "batch" TEST_NAME "${TEST_NAME}")
string(REPLACE "cluster_m=" "" TEST_NAME "${TEST_NAME}")
string(REPLACE "_cluster_n=" "x" TEST_NAME "${TEST_NAME}")
string(REGEX REPLACE "_cluster_k=[0-9]+" "" TEST_NAME "${TEST_NAME}")
string(REPLACE "cluster_m_fallback=" "" TEST_NAME "${TEST_NAME}")
string(REPLACE "_cluster_n_fallback=" "x" TEST_NAME "${TEST_NAME}")
string(REGEX REPLACE "_cluster_k_fallback=[0-9]+" "" TEST_NAME "${TEST_NAME}")
string(REPLACE "runtime_input_datatype_a=" "" TEST_NAME "${TEST_NAME}")
string(REPLACE "runtime_input_datatype_b=" "" TEST_NAME "${TEST_NAME}")
string(REPLACE "=" "" TEST_NAME "${TEST_NAME}")
string(REPLACE "_error_on_no_match" "" TEST_NAME "${TEST_NAME}")
string(REPLACE "_error_if_nothing_is_profiled" "" TEST_NAME "${TEST_NAME}")
string(REPLACE "kernels" "" TEST_NAME "${TEST_NAME}")
string(REPLACE "operation" "" TEST_NAME "${TEST_NAME}")

if (__DO_NOT_LOWERCASE_TEST_NAME)
string(TEST_NAME_LOWER "${TEST_NAME}")
else()
string(TOLOWER "${TEST_NAME}" TEST_NAME_LOWER)
endif()

# Munge the test command
string(REPLACE "cutlass_profiler" "" TEST "${TEST}")
set(TEST "${TEST}" ${__CUTLASS_PROFILER_EXTRA_OPTIONS} "--junit-output=${TEST_NAME_LOWER}")
set(TEST_COMMAND_${TEST_NAME_LOWER} "${TEST}")
list(APPEND TEST_COMMAND_VARS ${TEST_NAME_LOWER})

endif()

endforeach()

cutlass_add_executable_tests(
${NAME} cutlass_profiler
DEPENDS ${__DEPENDS}
DEPENDEES ${__DEPENDEES}
TEST_COMMAND_OPTIONS ${TEST_COMMAND_VARS}
TEST_COMMAND_OPTIONS_PREFIX TEST_COMMAND_
DISABLE_EXECUTABLE_INSTALL_RULE
# Uncomment the following line when alloc/dealloc tracking
# is fixed for all configurations.
# TEST_SETS_SUPPORTED tmem_alloc_tracking
)

endfunction()



if (CUTLASS_ENABLE_TOOLS)
add_subdirectory(tools)
if (CUTLASS_ENABLE_PROFILER)
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,11 +87,11 @@ For a background on Blackwell's new features, please consult the PTX documentati
- [Fused multi-head attention fprop kernel](./examples/77_blackwell_fmha/77_blackwell_fmha.cu) supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
* Documentation updates:
- [Quickstart - instantiating a Blackwell block-scaled GEMM](./media/docs/quickstart.md#instantiating-a-blackwell-gemm-kernel).
- Detailed [Blackwell block-scaled GEMM functionality documentation](./media/docs/narrow_and_mixed_precision_gemms.md)
- Detailed [Blackwell block-scaled GEMM functionality documentation](./media/docs/blackwell_functionality.md)
- A new [functionality documentation](./media/docs/functionality.md) specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
- Updates to [compatibility](./README.md#compatibility) section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and [Target Architecture](./README.md#Target-Architecture).

Note: CUTLASS 3.x builds are known to be broken on Windows platforms for all CUDA toolkits.
Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.

**See the [CHANGELOG](CHANGELOG.md) for details of all past releases and updates.**
Expand Down Expand Up @@ -162,7 +162,7 @@ We have tested the following environments.

Note: GCC 8.5.0 has known regressions regarding fold expressions and overloaded operators. Using GCC 7.5.0 or (preferred) GCC >= 9 is recommended.

Note: CUTLASS 3.x builds are known to be broken on Windows platforms for all CUDA toolkits.
Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.

## Hardware
Expand Down
92 changes: 92 additions & 0 deletions customConfigs.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#
# 1. Redistributions of source code must retain the above copyright notice, this
# list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
#
# 3. Neither the name of the copyright holder nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.





# Profiler based functional testing
set(CUTLASS_BUILD_FOR_PROFILER_REGRESSIONS OFF CACHE BOOL "Utilize profiler-based functional regressions")
set(CUTLASS_PROFILER_REGRESSION_TEST_LEVEL ${CUTLASS_TEST_LEVEL} CACHE STRING "Profiler functional regression test level")

find_package(Python3 3.5 COMPONENTS Interpreter REQUIRED)

function(cutlass_generate_kernel_filter_and_testlists_files)

set(options)
set(oneValueArgs TEST_SET_NAME)
set(multiValueArgs)
cmake_parse_arguments(_ "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})

execute_process(
COMMAND ${CMAKE_COMMAND} -E env PYTHONPATH=${CUTLASS_LIBRARY_PACKAGE_DIR}
${Python3_EXECUTABLE} ${CUTLASS_SOURCE_DIR}/python/cutlass_library/generator.py
--generator-target=${__TEST_SET_NAME}
--cuda-version=${CUTLASS_GENERATOR_CUDA_COMPILER_VERSION}
--architectures=${CUTLASS_NVCC_ARCHS}
--kernels=\*
--disable-cutlass-package-imports
WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}
RESULT_VARIABLE cutlass_FILTER_GENERATION_RESULT
OUTPUT_VARIABLE cutlass_FILTER_GENERATION_OUTPUT
OUTPUT_FILE ${CMAKE_CURRENT_BINARY_DIR}/library_filter_generation.log
ERROR_FILE ${CMAKE_CURRENT_BINARY_DIR}/library_filter_generation.log
)

if(NOT cutlass_FILTER_GENERATION_RESULT EQUAL 0)
message(FATAL_ERROR "Error generating kernel filters and testlists files. See ${CMAKE_CURRENT_BINARY_DIR}/library_filter_generation.log")
endif()
endfunction()

if(CUTLASS_BUILD_FOR_PROFILER_REGRESSIONS)

set(PROFILER_ARCH_LIST 100a)
foreach(ARCH IN LISTS CUTLASS_NVCC_ARCHS)
if(NOT (ARCH IN_LIST PROFILER_ARCH_LIST))
message(FATAL_ERROR "Only SM100a compute capability is supported with profiler-based unit tests")
endif()
endforeach()

if(CUTLASS_PROFILER_REGRESSION_TEST_LEVEL EQUAL 0)

message(STATUS "Building for L0 profiler-based functional regressions")
cutlass_generate_kernel_filter_and_testlists_files(TEST_SET_NAME kernel_testlist_l0)
set(KERNEL_FILTER_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L0_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm_kernel_filter.list CACHE STRING "Kernel set")
set(CUTLASS_PROFILER_REGRESSION_LIST_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L0_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm.csv CACHE STRING "Regression set")

elseif (CUTLASS_PROFILER_REGRESSION_TEST_LEVEL EQUAL 1)

message(STATUS "Building for L1 profiler-based functional regressions")
cutlass_generate_kernel_filter_and_testlists_files(TEST_SET_NAME kernel_testlist_l1)
set(KERNEL_FILTER_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L1_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm_kernel_filter.list CACHE STRING "Kernel set")
set(CUTLASS_PROFILER_REGRESSION_LIST_FILE ${CMAKE_CURRENT_BINARY_DIR}/FK_functional_L1_testlist_SM${CUTLASS_NVCC_ARCHS}_cutlass3x_gemm.csv CACHE STRING "Regression set")

endif()
endif()


Original file line number Diff line number Diff line change
Expand Up @@ -483,18 +483,13 @@ int main(int argc, char const **args) {
CUDA_CHECK(cudaGetDevice(&current_device_id));
CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
cudaError_t error = cudaGetDeviceProperties(&props, 0);
if (props.major < 9) {
std::cerr
<< "This example requires a GPU of NVIDIA's Hopper Architecture or "
<< "later (compute capability 90 or greater).\n";
return 0;
}

if (props.major != 9 || props.minor != 0) {
std::cerr
<< "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
return 0;
}




//
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -566,17 +566,13 @@ int main(int argc, char const **args) {
CUDA_CHECK(cudaGetDevice(&current_device_id));
CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
cudaError_t error = cudaGetDeviceProperties(&props, 0);
if (props.major < 9) {
if (props.major != 9 || props.minor != 0) {
std::cerr
<< "This example requires a GPU of NVIDIA's Hopper Architecture or "
<< "later (compute capability 90 or greater).\n";
<< "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
return 0;
}


else if (props.major != 9 || props.minor != 0) {
std::cerr << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
return 0;
}


//
Expand Down
Loading