Author Topic: Programming Tesla V100 Tensor Cores in CUDA 9  (Read 1796 times)

0 Members and 1 Guest are viewing this topic.


  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1508
    • View Profile
Programming Tesla V100 Tensor Cores in CUDA 9
« on: October 18, 2017, 08:18:45 AM »
A defining feature of the new Volta GPU Architecture is its Tensor Cores, which give the Tesla V100 accelerator a peak throughput 12 times the 32-bit floating point throughput of the previous-generation Tesla P100.

Tensor Cores are already supported for Deep Learning training either in a main release or via pull requests in many Deep Learning frameworks (including Tensorflow, PyTorch, MXNet, and Caffe2). For more information about enabling Tensor Cores when using these frameworks, check out the Mixed-Precision Training Guide. For Deep Learning inference the recent TensorRT 3 release also supports Tensor Cores.

In this blog post we show you how you to use Tensor Cores in your own application using CUDA Libraries as well as how to program them directly in CUDA C++ device code.

esla V100’s Tensor Cores are programmable matrix-multiply-and-accumulate units that can deliver up to 125 Tensor TFLOPS for training and inference applications. The Tesla V100 GPU contains 640 Tensor Cores: 8 per SM. Tensor Cores and their associated data paths are custom-crafted to dramatically increase floating-point compute throughput at only modest area and power costs. Clock gating is used extensively to maximize power savings.

What are Tensor Cores?
Each Tensor Core provides a 4x4x4 matrix processing array which performs the operation D = A * B + C, where A, B, C and D are 4×4 matrices as Figure 1 shows. The matrix multiply inputs A and B are FP16 matrices, while the accumulation matrices C and D may be FP16 or FP32 matrices.

Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 input multiply with full-precision product and FP32 accumulate, as Figure 2 shows) and 8 Tensor Cores in an SM perform a total of 1024 floating point operations per clock. This is a dramatic 8X increase in throughput for deep learning applications per SM compared to Pascal GP100 using standard FP32 operations, resulting in a total 12X increase in throughput for the Volta V100 GPU compared to the Pascal P100 GPU.

Complete article: