[Test] OpenGL Geometry Instancing: GeForce GTX 480 vs Radeon HD 5870

OpenGL

OpenGL Geometry Instancing: GTX 480 vs HD 5870
Asteroid belt and Geometry Instancing: up to 180 millions polygons rendered in real time – Click to zoom



I took the time for updating the OpenGL Geometry Instancing demo pack updated I released more than 2 years ago on my infamous lab. The first version of the demo pack
has been developed and tested with a GeForce 8800 GTX and a Radeon HD 3870 and only the 8800 GTX had the HW GI (HardWare Geometry Instancing) support that why the demos didn’t start if the card was not a 8800 GTX…

Today, all recent NVIDIA and AMD graphics cards support the HW GI so a serious update of the pack was necessary. It’s done. And of course it’s not a simple recompilation with the 8800 GTX test in less, the demos have been updated with many more polygons and a new GI technique based on uniform buffer. And this time, all GI techniques work with GeForce and Radeon. Thanks to NV and ATI drivers teams!



You can grab the demo pack here:
Download OpenGL Geometry Instancing DemoPack Version 2010.06.29

You can move the camera with the mouse and move it with AWSD keys.
The skybox used in the demo comes from this page.



In short, geometry instancing makes it possible to render several instances of the same mesh at once. Geometry instancing techniques aim to minimize the number of draw calls (and / or to speed up them) required to render all instances. The ideal scenario is an unique render call for all instances, one render call to rule them all!

This GI demo pack includes 5 GI techniques. Each GI technique can be enabled by F2 to F6 keys. And to illustrate the GI techniques, the demo renders an asteroid belt with asteroids, lot of asteroids…

OpenGL Geometry Instancing: GTX 480 vs HD 5870
Asteroid belt and Geometry Instancing: up to 180 millions polygons rendered in real time – Click to zoom

The demo is available in 10 versions (sorry, I was too lazy to code a GUI – maybe in a real benchmark version, who knows…):

  • 20,000 asteroids, 18 triangles per asteroid: 360,000 tri.
  • 20,000 asteroids, 72 triangles per asteroid: 1,440,000 tri.
  • 20,000 asteroids, 450 triangles per asteroid: 9,000,000 tri.
  • 20,000 asteroids, 800 triangles per asteroid: 16,000,000 tri.
  • 20,000 asteroids, 1800 triangles per asteroid: 36,000,000 tri.

and

  • 100,000 asteroids, 18 triangles per asteroid: 1,800,000 tri.
  • 100,000 asteroids, 72 triangles per asteroid: 7,200,000 tri.
  • 100,000 asteroids, 450 triangles per asteroid: 45,000,000 tri.
  • 100,000 asteroids, 800 triangles per asteroid: 80,000,000 tri.
  • 100,000 asteroids, 1800 triangles per asteroid: 180,000,000 tri.



Here is the description of each geometry instancing technique I used:

  • F2 key: simple instancing: there is one source for geometry (a mesh) and this geometry is rendered for each instance. The tranformation matrix of each instance is calculated on the CPU. OpenGL rendering uses the glDrawElements() function. The number of draw calls is equal to the number of instances. The most simple and inefficient geometry instancing technique…

  • F3 key: slow pseudo-instancing: there is one source for geometry (a mesh) and this geometry is rendered for each instance. Now the tranformation matrix is computed on the GPU and per-instance data is passed via uniform variables (a vec4 for teh position and a vec4 for the orientation). OpenGL rendering uses the glDrawElements() function. The number of draw calls is equal to the number of instances. This technique is faster than F2 (simple instancing).

  • F4 key: Pseudo-Instancing: there is one source for geometry (a mesh) and this geometry is rendered for each instance. The tranformation matrix is computed on the GPU and per-instance data is passed via persistent vertex attributes. Persistent vertex attributes are for example the normal, the texture coordinates or the color (respectively set with glNormal(), glMultiTexCoord() and glColor()). This technique has been shown by NVIDIA in the following whitepaper: GLSL Pseudo-Instancing. OpenGL rendering uses the glDrawElements() function. The number of draw calls is equal to the number of instances. pseudo-instancing is extremely efficient on NVIDIA hardware. See results below…

  • F5 key: geometry instancing: it’s the real hardware instancing (HW GI). There is one source for geometry (a mesh) and rendering is done by batchs of 64 instances per draw-call. Actually on NVIDIA hardware, 400 instances can be rendered with one draw call but that does not work on ATI due to the limitation of the number of vertex uniforms. 64 instances per batch work fine on both ATI and NVIDIA. The tranformation matrix is computed on the GPU and per-batch data is passed via uniform arrays: there is an uniforn array of vec4 for positions and another vec4 array for rotations. OpenGL rendering uses the glDrawElementsInstancedARB() function. The GL_ARB_draw_instanced extension is required. The HW GI allows to drastically reduce the number of draw calls: for the 20,000-asteroid belt, we have 20000/64 = 313 draw calls instead of 20,000.

  • F6 key: geometry instancing with uniform buffer: it’s still the real hardware instancing (HW GI). There is one source for geometry (a mesh) and rendering is done by batchs of… 1000 instances per draw-call. A 1000-instance batch works fine on the GTX 480 and the HD 5870. The tranformation matrix is computed on the GPU and per-batch data is passed via a big buffer of uniforms: an uniform buffer object or UBO. This technique requires the support of the GL_ARB_uniform_buffer_object extension. UBO allows a huge reduction of the number of draw calls: for 20,000 instances and 1000 instances per draw call, the complete asteroid belt requires 20 draw calls! Like the previous GI technique, OpenGL rendering uses the glDrawElementsInstancedARB() function.



Testbed:
- CPU: Intel Core i7 960 @ 3GHz
- RAM: 4GB DDR3 Corsair Dominator 1600MHz
- Mobo: Gigabyte GA-X58A-UD5
- PSU: Antec TPQ 850W
- OS: Windows 7 64-bit
- GeForce driver: R257.21
- Radeon driver: Catalyst 10.6

Graphics cards:
- EVGA GTX 480
- Radeon HD 5870 reference board

For each test, I read the GPU usage with EVGA Precision and the CPU usage with Windows task manager:

EVGA Precision

Windows task manager
12% or 13% means one logical CPU core is fully used.



Here are the results (default resolution: 1024×600 and default camera position):

20,000 instances x 18 tri/instance = 360,000 tri

Radeon HD 5870:
- F2: FPS=43, GPU=68%, CPU=12%
- F3: FPS=55, GPU=88%, CPU=12%
- F4: FPS=45, GPU=25%, CPU=12%
- F5: FPS=134, GPU=22%, CPU=12%
- F6: FPS=139, GPU=24%, CPU=12%
GeForce GTX 480:
- F2: FPS=31, GPU=36%, CPU=12%
- F3: FPS=48, GPU=56%, CPU=12%
- F4: FPS=117, GPU=22%, CPU=12%
- F5: FPS=150, GPU=34%, CPU=12%
- F6: FPS=164, GPU=40%, CPU=12%

20,000 instances, 72 tri/instance = 1,440,000 tri

Radeon HD 5870:
- F2: FPS=42, GPU=67%, CPU=12%
- F3: FPS=55, GPU=87%, CPU=12%
- F4: FPS=45, GPU=26%, CPU=12%
- F5: FPS=133, GPU=37%, CPU=12%
- F6: FPS=139, GPU=41%, CPU=12%
GeForce GTX 480:
- F2: FPS=32, GPU=21%, CPU=12%
- F3: FPS=48, GPU=33%, CPU=12%
- F4: FPS=117, GPU=22%, CPU=12%
- F5: FPS=150, GPU=37%, CPU=12%
- F6: FPS=163, GPU=44%, CPU=12%

20,000 instances, 450 tri/instance = 9,000,000 tri

Radeon HD 5870:
- F2: FPS=43, GPU=69%, CPU=12%
- F3: FPS=54, GPU=87%, CPU=12%
- F4: FPS=45, GPU=64%, CPU=12%
- F5: FPS=79, GPU=99%, CPU=12%
- F6: FPS=73, GPU=99%, CPU=12%
GeForce GTX 480:
- F2: FPS=31, GPU=53%, CPU=12%
- F3: FPS=44, GPU=99%, CPU=12%
- F4: FPS=113, GPU=99%, CPU=12%
- F5: FPS=114, GPU=99%, CPU=12%
- F6: FPS=112, GPU=99%, CPU=12%

20,000 instances, 800 tri/instance = 16,000,000 tri

Radeon HD 5870:
- F2: FPS=42, GPU=99%, CPU=12%
- F3: FPS=43, GPU=99%, CPU=12%
- F4: FPS=43, GPU=99%, CPU=12%
- F5: FPS=48, GPU=99%, CPU=12%
- F6: FPS=46, GPU=99%, CPU=12%
GeForce GTX 480:
- F2: FPS=31, GPU=56%, CPU=12%
- F3: FPS=42, GPU=99%, CPU=12%
- F4: FPS=66, GPU=99%, CPU=12%
- F5: FPS=67, GPU=99%, CPU=12%
- F6: FPS=66, GPU=99%, CPU=12%

20,000 instances, 1800 tri/instance = 36,000,000 tri

Radeon HD 5870:
- F2: FPS=22, GPU=99%, CPU=12%
- F3: FPS=22, GPU=99%, CPU=12%
- F4: FPS=22, GPU=99%, CPU=12%
- F5: FPS=22, GPU=99%, CPU=12%
- F6: FPS=22, GPU=99%, CPU=12%
GeForce GTX 480:
- F2: FPS=30, GPU=99%, CPU=12%
- F3: FPS=30, GPU=99%, CPU=9%
- F4: FPS=31, GPU=99%, CPU=12%
- F5: FPS=32, GPU=99%, CPU=12%
- F6: FPS=32, GPU=99%, CPU=12%

100,000 instances, 18 tri/instance = 1,800,000 tri

Radeon HD 5870:
- F2: FPS=9, GPU=67%, CPU=12%
- F3: FPS=12, GPU=85%, CPU=12%
- F4: FPS=10, GPU=23%, CPU=12%
- F5: FPS=33, GPU=21%, CPU=12%
- F6: FPS=37, GPU=20%, CPU=12%
GeForce GTX 480:
- F2: FPS=7, GPU=34%, CPU=12%
- F3: FPS=10, GPU=54%, CPU=12%
- F4: FPS=25, GPU=14%, CPU=12%
- F5: FPS=32, GPU=25%, CPU=12%
- F6: FPS=35, GPU=30%, CPU=12%

100,000 instances, 72 tri/instance = 7,200,000 tri

Radeon HD 5870:
- F2: FPS=9, GPU=67%, CPU=12%
- F3: FPS=12, GPU=85%, CPU=12%
- F4: FPS=10, GPU=24%, CPU=12%
- F5: FPS=33, GPU=35%, CPU=12%
- F6: FPS=37, GPU=41%, CPU=12%
GeForce GTX 480:
- F2: FPS=7, GPU=36%, CPU=12%
- F3: FPS=10, GPU=57%, CPU=12%
- F4: FPS=25, GPU=38%, CPU=12%
- F5: FPS=32, GPU=36%, CPU=12%
- F6: FPS=35, GPU=45%, CPU=12%

100,000 instances, 450 tri/instance = 45,000,000 tri

Radeon HD 5870:
- F2: FPS=9, GPU=69%, CPU=12%
- F3: FPS=12, GPU=90%, CPU=12%
- F4: FPS=10, GPU=65%, CPU=12%
- F5: FPS=17, GPU=99%, CPU=12%
- F6: FPS=16, GPU=99%, CPU=12%
GeForce GTX 480:
- F2: FPS=7, GPU=53%, CPU=12%
- F3: FPS=9, GPU=99%, CPU=12%
- F4: FPS=23, GPU=99%, CPU=12%
- F5: FPS=24, GPU=99%, CPU=10%
- F6: FPS=23, GPU=99%, CPU=8%

100,000 instances, 800 tri/instance = 80,000,000 tri

Radeon HD 5870:
- F2: FPS=9, GPU=99%, CPU=12%
- F3: FPS=9, GPU=99%, CPU=12%
- F4: FPS=9, GPU=99%, CPU=12%
- F5: FPS=10, GPU=99%, CPU=12%
- F6: FPS=10, GPU=99%, CPU=12%
GeForce GTX 480:
- F2: FPS=7, GPU=56%, CPU=12%
- F3: FPS=9, GPU=99%, CPU=12%
- F4: FPS=14, GPU=99%, CPU=7%
- F5: FPS=14, GPU=99%, CPU=7%
- F6: FPS=14, GPU=99%, CPU=6%

100,000 instances, 1800 tri/instance = 180,000,000 tri

Radeon HD 5870:
- F2: FPS=5, GPU=99%, CPU=8%
- F3: FPS=5, GPU=99%, CPU=8%
- F4: FPS=5, GPU=99%, CPU=8%
- F5: FPS=5, GPU=99%, CPU=12%
- F6: FPS=5, GPU usage: 99%, CPU=12%
GeForce GTX 480:
- F2: FPS=6, GPU=99%, CPU=12%
- F3: FPS=6, GPU=99%, CPU=9%
- F4: FPS=7, GPU=99%, CPU=4%
- F5: FPS=7, GPU=99%, CPU=3%
- F6: FPS=7, GPU=99%, CPU=4%



Quick analysis:

  • we understand why NVIDIA has called the technique using persistent vertex attributes Pseudo-Instancing (key F4). OpenGL glDrawElements() function is extremly fast and persistent vertex attributes require less overhead than uniforms to be passed to the vertex shader. Both coupled together give this performance boost. Very often, Pseudo-Instancing is almost as fast as HW GI. But Pseudo-Instancing is only fast on NVIDIA hardware.
  • The benefit of hardware geometry instancing is mostly visible with few triangles per instance: 500 faces per instance seems a limit. With many triangles per instance, all techniques are similar.
  • With many triangles per instance (1800 tri) and with many instances (100,000), NVIDIA drivers does not require much CPU power compared to ATI drivers.



Do not hesitate to comment this article and don’t panic if your comments are not quickly approved. I’m moving and I won’t be connected during the next few days. I’ll approve all comments as soon as possible!

See you later my friends!

Moving day
Geometry instancing could help me for my move :D



Related posts:

  1. OpenGL 3.2 Geometry Instancing Culling on GPU Demo
  2. [Test] OpenGL 4.0 and Direct3D 11 Tessellation: GTX 480 vs GTX 470 vs HD 5870 vs HD 5770
  3. [Quick Test] Unigine Heaven 2.1: GTX 480 vs GTX 470 vs HD 5870 in OpenGL 4.0 and Direct3D 11 in Extreme Tessellation
  4. OpenGL Geometry Instancing DemoPack
  5. [TEST] AvP DX11 Tessellation Battle: GTX 480 vs GTX 470 vs HD 5870 vs HD 5770