NVIDIA CUDA Toolkit 12.0

Started by Stefan, December 09, 2022, 12:08:14 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.


Download now

1.2. New Features�

This section lists new general CUDA and CUDA compilers features.
1.2.1. General CUDA�


        CUDA 12.0 exposes programmable functionality for many features of the Hopper and Ada Lovelace architectures:

            Many tensor operations now available via public PTX:

                TMA operations

                TMA bulk operations

                32x Ultra xMMA (including FP8/FP16)

            Membar domains in Hopper, controlled via launch parameters

            Smem sync unit PTX and C++ API support

            Introduced C intrinsics for Cooperative Grid Array (CGA) relaxed barrier support

            Programmatic L2 Cache to SM multicast (Hopper-only)

            Public PTX for SIMT collectives - elect_one

            Genomics/DPX instructions now available for Hopper GPUs to provide faster combined-math arithmetic operations (three-way max, fused add+max, etc.)

        Enhancements to the CUDA graphs API:

            You can now schedule graph launches from GPU device-side kernels by calling built-in functions. With this ability, user code in kernels can dynamically schedule graph launches, greatly increasing the flexibility of CUDA graphs.

            The cudaGraphInstantiate() API has been refactored to remove unused parameters.

        Added the ability to use virtual memory management (VMM) APIs such as cuMemCreate() with GPUs masked by CUDA_VISIBLE_DEVICES.

        Application and library developers can now programmatically update the priority of CUDA streams.

        CUDA 12.0 adds support for revamped CUDA Dynamic Parallelism APIs, offering substantial performance improvements vs. the legacy CUDA Dynamic Parallelism APIs.

        Added new APIs to obtain unique stream and context IDs from user-provided objects:

            cuStreamGetId(CUstream hStream, unsigned long long *streamId)

            cuCtxGetId(CUcontext ctx, unsigned long long *ctxId)

        Added support for read-only cuMemSetAccess() flag CU_MEM_ACCESS_FLAGS_PROT_READ.

1.2.2. CUDA Compilers�


    JIT LTO support is now officially part of the CUDA Toolkit through a separate nvJitLink library. A technical deep dive blog will go into more details. Note that the earlier implementation of this feature has been deprecated. Refer to the Deprecation/Dropped Features section below for details.

    New host compiler support:

        GCC 12.1 (Official) and 12.2.1 ( Experimental)

        VS 2022 17.4 Preview 3 fixes compiler errors mentioning an internal function std::_Bit_cast by using CUDA's support for __builtin_bit_cast.

    NVCC and NVRTC now support the c++20 dialect. Most of the language features are available in host and device code; some such as coroutines are not supported in device code. Modules are not supported for both host and device code. Host Compiler Minimum Versions: GCC 10, Clang 11, VS2022, Arm C/C++ 22.x. Refer to the individual Host Compiler documentation for other feature limitations. Note that a compilation issue in C++20 mode with <complex> header mentioning an internal function std::_Bit_cast is resolved in VS2022 17.4.

    NVRTC default C++ dialect changed from C++14 to C++17. Refer to the ISO C++ standard for reference on the feature set and compatibility between the dialects.

    NVVM IR Update: with CUDA 12.0 we are releasing NVVM IR 2.0 which is incompatible with NVVM IR 1.x accepted by the libNVVM compiler in prior CUDA toolkit releases. Users of the libNVVM compiler in CUDA 12.0 toolkit must generate NVVM IR 2.0.