NVIDIA Fermi: the Nuclear GPU for Scientific Applications

NVIDIA Fermi GT300

Most of the information you can find over the Net comes from NVIDIA Fermi Architecture Whitepaper (PDF). After a quick reading of this paper and some websites, here is a summary of NVIDIA new nuclear GPU (General Processing Unit 😉 ) :

– GT200 codename: Tesla (Nikola Tesla)
GT300 codename: Fermi (Enrico Fermi)

– Fermi is planned for the end of 2009 (best case) or at the beginning of 2010. Why Fermi is late ? Because designing GPUs this big is “fucking hard”, said Ujesh Desai, NVIDIA’s VP of Product Marketing (ref).

NVIDIA GT300 - video card

NVIDIA GT300 - video card
NVIDIA GT300 – video card

NVIDIA GT300 - GPU - front face
NVIDIA GT300 – front face

NVIDIA GT300 - GPU - front face
NVIDIA GT300 – back face

But warning, maybe these cards and GPUs are FAKE (see HERE). Actually, Nvidia has recently clarified that Fermi demo board was engineering” prototype, driven by a Fermi-based Tesla chip (see HERE).

Main features:

  • GPU: GT300 / 40nm (TSMC process) / 3 billions transistors
  • 16 SM (Streaming Multiprocessor – the SM executes threads in groups of 32 threads called a warp)
  • 32 CUDA cores per SM
  • Total 512 CUDA cores (240 in the GT200)
  • Memory: six 64-bit memory partitions, for a 384-bit memory interface (512-bit for GT200), supporting up to a total of 6 GB of GDDR5 DRAM
  • ~650/1700/4200MHz (base/hot/mem)
  • 48 ROPs, 8Z/C clock
  • 128 TMUs (80 for the GT200)
  • TDP: of just over 225W.

GT300 die:

NVIDIA GT300 - Die

NVIDIA GT300 - Die

NVIDIA GT300 - Streaming multiprocessor

NVIDIA GT300 - Architecture

NVIDIA GT200 - Architecture

Fermi is designed to be a general purpose compute machine, the next engine of science and aims at cover the entire sprectrum of scientific applications. ATI’s Cypress is specifically targeted to run the GPU’s killer app today: 3D games.

– Fermi is the first architecture to support the new Parallel Thread eXecution (PTX) 2.0 instruction set. PTX is a low level virtual machine and ISA designed to support the operations of a parallel thread processor.

– With PTX 2.0, an unified address space unifies all three address spaces into a single, continuous address space. The three address spaces that are now unified are (see CUDA developer’s guide for more details):

  • local (thread private)
  • shared (for each thread block)
  • global memory (device and system-wide)

With an unified address space, indirection for data structures is possible. Nvidia now supports pointers and object references, which are necessary for C++ and most other high-level languages which pass by reference.

From NVIDIA Fermi architecture PDF:

The implementation of a unified address space enables Fermi to support true C++ programs. In C++, all variables and functions reside in objects which are passed via pointers. PTX 2.0 makes it possible to use unified pointers to pass objects in any memory space, and Fermi’s hardware address translation unit automatically maps pointer references to the correct memory space. Fermi and the PTX 2.0 ISA also add support for C++ virtual functions, function pointers, and new and delete operators for dynamic object allocation and de-allocation. C++ exception handling operations ‘try’ and ‘catch’ are also supported.

Cool! We have more or less an ansver to the question I asked HERE.

– Fermi is optimized for OpenCL and DirectCompute (Micro$oft).

Concurrent Kernel Execution:
Concurrent kernel execution allows programs that execute a number of small kernels to utilize the whole GPU. For example, a PhysX program may invoke a fluids solver and a rigid body solver which, if executed sequentially, would use only half of the available thread processors. On the Fermi architecture, different kernels of the same CUDA context can execute concurrently, allowing maximum utilization of GPU resources.

NVIDIA GT300 - PhysX fluid simulation performance

More readings: