Author Topic: NVIDIA toolkit 11.0.1_451.22 RC released  (Read 1857 times)

0 Members and 1 Guest are viewing this topic.

Stefan

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4574
NVIDIA toolkit 11.0.1_451.22 RC released
« on: June 06, 2020, 05:46:22 AM »
  • CUDA 11.0 adds support for the NVIDIA Ampere GPU microarchitecture (compute_80 and sm_80).
  • CUDA 11.0 adds support for NVIDIA A100 GPUs and systems that are based on A100. The A100 GPU adds the following capabilities for compute via CUDA:
    • Alternate floating point data format Bfloat16 (__nv_bfloat16) and compute type TF32 (tf32)
    • Double precision matrix multiply accumulate through the DMMA instruction (see note on WMMA in CUDA C++ and mma in PTX)
    • Support for asynchronous copy instructions that allow copying of data asynchronously (LDGSTS instruction and the corresponding cp.async.* PTX instructions)
    • Cooperative groups improvements, which allow reduction operation across threads in a warp (using the redux.sync instruction)
    • Support for hardware partitioning via Multi-Instance GPU (MIG). See the driver release notes on more information on the corresponding NVML APIs and nvidia-smi CLI tools for configuring MIG instances
  • Added the 7.0 version of the Parallel Thread Execution instruction set architecture (ISA). For more details on new (sm_80 target, new instructions, new floating point data types in .bf16, .tf32, and new mma shapes) and deprecated instructions, see this section in the PTX documentation.
  • CUDA 11.0 adds support for the Arm server platform (arm64 SBSA). Note that with this release, only the following platforms are supported with Tesla V100 GPU:
    • HPE Apollo 70 (using Marvell ThunderX2™ CN99XX)
    • Gigabyte R2851 (using Marvell ThunderX2™ CN99XX)
    • Huawei TaiShan 2280 V2 (using Huawei Kunpeng 920)
  • CUDA supports a wide range of Linux and Windows distributions. For a full list of supported operating systems, see system requirements for more information. The following new Linux distributions are supported in CUDA 11.0.
    For x86 (x86_64):
    • Red Hat Enterprise Linux (RHEL) 8.1
    • Ubuntu 18.04.4 LTS
    For Arm (arm64):
    • SUSE SLES 15.1
    For POWER (ppc64le):
    • Red Hat Enterprise Linux (RHEL) 8.1
  • CUDA C++ includes support for new data types to support new 16-bit floating point data (with 1-sign bit, 8-bit exponent and 7-bit mantissa): __nv_bfloat16 and __nv_bfloat162. See include/cuda_bf16.hpp and the CUDA Math API for more information on the datatype definition and supported arithmetic operations.
  • CUDA 11.0 adds the following support for WMMA:
    • Added support for double (FP64) to the list of available input/output types for 8x8x4 shapes (DMMA.884)
    • AND bitwise operation supported for BMMA
    • Added support for __nv_bfloat16 and tf32 precision formats for the HMMA 16x16x8 shape
  • Added support for cooperative kernels in CUDA graphs, including stream capture for cuLaunchCooperativeKernel.
  • The CUDA_VISIBLE_DEVICES variable has been extended to add support for enumerating Multiple Instance GPUs (MIG) in NVIDIA A100/GA100 GPUs.
  • Added support for PCIe Relaxed Ordering for GPU initiated writes. This is not enabled by default but can be enabled by setting the following module parameter on Linux x86_64: NVreg_EnablePCIERelaxedOrderingMode.
  • CUDA 11.0 adds a specification for inter-task memory ordering in the "API Synchronization" subsection of the PTX memory model and allows CUDA's implementation to be optimized consistent with this addition. In rare cases, code may have assumed a stronger ordering than required by the added specification and may notice a functional regression. The environment variable CUDA_FORCE_INTERTASK_SYSTEM_FENCE may be set to a value of "0" to disable post-10.2 inter-task fence optimizations, or "1" to enable them for 445 and newer drivers. If the variable is not set, code compiled entirely against CUDA 10.2 or older will disable the optimizations and code compiled against 11.0 or newer will enable them. Code with mixed versions may see a combination.