[Test] OpenGL Geometry Instancing: GeForce GTX 480 vs Radeon HD 5870


Asteroid belt and Geometry Instancing: up to 180 millions polygons rendered in real time – Click to zoom
I took the time for updating the OpenGL Geometry Instancing demo pack updated I released more than 2 years ago on my infamous lab. The first version of the demo pack
has been developed and tested with a GeForce 8800 GTX and a Radeon HD 3870 and only the 8800 GTX had the HW GI (HardWare Geometry Instancing) support that why the demos didn’t start if the card was not a 8800 GTX…
Today, all recent NVIDIA and AMD graphics cards support the HW GI so a serious update of the pack was necessary. It’s done. And of course it’s not a simple recompilation with the 8800 GTX test in less, the demos have been updated with many more polygons and a new GI technique based on uniform buffer. And this time, all GI techniques work with GeForce and Radeon. Thanks to NV and ATI drivers teams!
You can grab the demo pack here:

You can move the camera with the mouse and move it with AWSD keys.
The skybox used in the demo comes from this page.
In short, geometry instancing makes it possible to render several instances of the same mesh at once. Geometry instancing techniques aim to minimize the number of draw calls (and / or to speed up them) required to render all instances. The ideal scenario is an unique render call for all instances, one render call to rule them all!
This GI demo pack includes 5 GI techniques. Each GI technique can be enabled by F2 to F6 keys. And to illustrate the GI techniques, the demo renders an asteroid belt with asteroids, lot of asteroids…

Asteroid belt and Geometry Instancing: up to 180 millions polygons rendered in real time – Click to zoom
The demo is available in 10 versions (sorry, I was too lazy to code a GUI – maybe in a real benchmark version, who knows…):
- 20,000 asteroids, 18 triangles per asteroid: 360,000 tri.
- 20,000 asteroids, 72 triangles per asteroid: 1,440,000 tri.
- 20,000 asteroids, 450 triangles per asteroid: 9,000,000 tri.
- 20,000 asteroids, 800 triangles per asteroid: 16,000,000 tri.
- 20,000 asteroids, 1800 triangles per asteroid: 36,000,000 tri.
and
- 100,000 asteroids, 18 triangles per asteroid: 1,800,000 tri.
- 100,000 asteroids, 72 triangles per asteroid: 7,200,000 tri.
- 100,000 asteroids, 450 triangles per asteroid: 45,000,000 tri.
- 100,000 asteroids, 800 triangles per asteroid: 80,000,000 tri.
- 100,000 asteroids, 1800 triangles per asteroid: 180,000,000 tri.
Here is the description of each geometry instancing technique I used:
- F2 key: simple instancing: there is one source for geometry (a mesh) and this geometry is rendered for each instance. The tranformation matrix of each instance is calculated on the CPU. OpenGL rendering uses the glDrawElements() function. The number of draw calls is equal to the number of instances. The most simple and inefficient geometry instancing technique…
- F3 key: slow pseudo-instancing: there is one source for geometry (a mesh) and this geometry is rendered for each instance. Now the tranformation matrix is computed on the GPU and per-instance data is passed via uniform variables (a vec4 for teh position and a vec4 for the orientation). OpenGL rendering uses the glDrawElements() function. The number of draw calls is equal to the number of instances. This technique is faster than F2 (simple instancing).
- F4 key: Pseudo-Instancing: there is one source for geometry (a mesh) and this geometry is rendered for each instance. The tranformation matrix is computed on the GPU and per-instance data is passed via persistent vertex attributes. Persistent vertex attributes are for example the normal, the texture coordinates or the color (respectively set with glNormal(), glMultiTexCoord() and glColor()). This technique has been shown by NVIDIA in the following whitepaper: GLSL Pseudo-Instancing. OpenGL rendering uses the glDrawElements() function. The number of draw calls is equal to the number of instances. pseudo-instancing is extremely efficient on NVIDIA hardware. See results below…
- F5 key: geometry instancing: it’s the real hardware instancing (HW GI). There is one source for geometry (a mesh) and rendering is done by batchs of 64 instances per draw-call. Actually on NVIDIA hardware, 400 instances can be rendered with one draw call but that does not work on ATI due to the limitation of the number of vertex uniforms. 64 instances per batch work fine on both ATI and NVIDIA. The tranformation matrix is computed on the GPU and per-batch data is passed via uniform arrays: there is an uniforn array of vec4 for positions and another vec4 array for rotations. OpenGL rendering uses the glDrawElementsInstancedARB() function. The GL_ARB_draw_instanced extension is required. The HW GI allows to drastically reduce the number of draw calls: for the 20,000-asteroid belt, we have 20000/64 = 313 draw calls instead of 20,000.
- F6 key: geometry instancing with uniform buffer: it’s still the real hardware instancing (HW GI). There is one source for geometry (a mesh) and rendering is done by batchs of… 1000 instances per draw-call. A 1000-instance batch works fine on the GTX 480 and the HD 5870. The tranformation matrix is computed on the GPU and per-batch data is passed via a big buffer of uniforms: an uniform buffer object or UBO. This technique requires the support of the GL_ARB_uniform_buffer_object extension. UBO allows a huge reduction of the number of draw calls: for 20,000 instances and 1000 instances per draw call, the complete asteroid belt requires 20 draw calls! Like the previous GI technique, OpenGL rendering uses the glDrawElementsInstancedARB() function.
Testbed:
- CPU: Intel Core i7 960 @ 3GHz
- RAM: 4GB DDR3 Corsair Dominator 1600MHz
- Mobo: Gigabyte GA-X58A-UD5
- PSU: Antec TPQ 850W
- OS: Windows 7 64-bit
- GeForce driver: R257.21
- Radeon driver: Catalyst 10.6
Graphics cards:
- EVGA GTX 480
- Radeon HD 5870 reference board
For each test, I read the GPU usage with EVGA Precision and the CPU usage with Windows task manager:


12% or 13% means one logical CPU core is fully used.
Here are the results (default resolution: 1024×600 and default camera position):
20,000 instances x 18 tri/instance = 360,000 tri
Radeon HD 5870:
- F2: FPS=43, GPU=68%, CPU=12%
- F3: FPS=55, GPU=88%, CPU=12%
- F4: FPS=45, GPU=25%, CPU=12%
- F5: FPS=134, GPU=22%, CPU=12%
- F6: FPS=139, GPU=24%, CPU=12%
GeForce GTX 480:
- F2: FPS=31, GPU=36%, CPU=12%
- F3: FPS=48, GPU=56%, CPU=12%
- F4: FPS=117, GPU=22%, CPU=12%
- F5: FPS=150, GPU=34%, CPU=12%
- F6: FPS=164, GPU=40%, CPU=12%
20,000 instances, 72 tri/instance = 1,440,000 tri
Radeon HD 5870:
- F2: FPS=42, GPU=67%, CPU=12%
- F3: FPS=55, GPU=87%, CPU=12%
- F4: FPS=45, GPU=26%, CPU=12%
- F5: FPS=133, GPU=37%, CPU=12%
- F6: FPS=139, GPU=41%, CPU=12%
GeForce GTX 480:
- F2: FPS=32, GPU=21%, CPU=12%
- F3: FPS=48, GPU=33%, CPU=12%
- F4: FPS=117, GPU=22%, CPU=12%
- F5: FPS=150, GPU=37%, CPU=12%
- F6: FPS=163, GPU=44%, CPU=12%
20,000 instances, 450 tri/instance = 9,000,000 tri
Radeon HD 5870:
- F2: FPS=43, GPU=69%, CPU=12%
- F3: FPS=54, GPU=87%, CPU=12%
- F4: FPS=45, GPU=64%, CPU=12%
- F5: FPS=79, GPU=99%, CPU=12%
- F6: FPS=73, GPU=99%, CPU=12%
GeForce GTX 480:
- F2: FPS=31, GPU=53%, CPU=12%
- F3: FPS=44, GPU=99%, CPU=12%
- F4: FPS=113, GPU=99%, CPU=12%
- F5: FPS=114, GPU=99%, CPU=12%
- F6: FPS=112, GPU=99%, CPU=12%
20,000 instances, 800 tri/instance = 16,000,000 tri
Radeon HD 5870:
- F2: FPS=42, GPU=99%, CPU=12%
- F3: FPS=43, GPU=99%, CPU=12%
- F4: FPS=43, GPU=99%, CPU=12%
- F5: FPS=48, GPU=99%, CPU=12%
- F6: FPS=46, GPU=99%, CPU=12%
GeForce GTX 480:
- F2: FPS=31, GPU=56%, CPU=12%
- F3: FPS=42, GPU=99%, CPU=12%
- F4: FPS=66, GPU=99%, CPU=12%
- F5: FPS=67, GPU=99%, CPU=12%
- F6: FPS=66, GPU=99%, CPU=12%
20,000 instances, 1800 tri/instance = 36,000,000 tri
Radeon HD 5870:
- F2: FPS=22, GPU=99%, CPU=12%
- F3: FPS=22, GPU=99%, CPU=12%
- F4: FPS=22, GPU=99%, CPU=12%
- F5: FPS=22, GPU=99%, CPU=12%
- F6: FPS=22, GPU=99%, CPU=12%
GeForce GTX 480:
- F2: FPS=30, GPU=99%, CPU=12%
- F3: FPS=30, GPU=99%, CPU=9%
- F4: FPS=31, GPU=99%, CPU=12%
- F5: FPS=32, GPU=99%, CPU=12%
- F6: FPS=32, GPU=99%, CPU=12%
100,000 instances, 18 tri/instance = 1,800,000 tri
Radeon HD 5870:
- F2: FPS=9, GPU=67%, CPU=12%
- F3: FPS=12, GPU=85%, CPU=12%
- F4: FPS=10, GPU=23%, CPU=12%
- F5: FPS=33, GPU=21%, CPU=12%
- F6: FPS=37, GPU=20%, CPU=12%
GeForce GTX 480:
- F2: FPS=7, GPU=34%, CPU=12%
- F3: FPS=10, GPU=54%, CPU=12%
- F4: FPS=25, GPU=14%, CPU=12%
- F5: FPS=32, GPU=25%, CPU=12%
- F6: FPS=35, GPU=30%, CPU=12%
100,000 instances, 72 tri/instance = 7,200,000 tri
Radeon HD 5870:
- F2: FPS=9, GPU=67%, CPU=12%
- F3: FPS=12, GPU=85%, CPU=12%
- F4: FPS=10, GPU=24%, CPU=12%
- F5: FPS=33, GPU=35%, CPU=12%
- F6: FPS=37, GPU=41%, CPU=12%
GeForce GTX 480:
- F2: FPS=7, GPU=36%, CPU=12%
- F3: FPS=10, GPU=57%, CPU=12%
- F4: FPS=25, GPU=38%, CPU=12%
- F5: FPS=32, GPU=36%, CPU=12%
- F6: FPS=35, GPU=45%, CPU=12%
100,000 instances, 450 tri/instance = 45,000,000 tri
Radeon HD 5870:
- F2: FPS=9, GPU=69%, CPU=12%
- F3: FPS=12, GPU=90%, CPU=12%
- F4: FPS=10, GPU=65%, CPU=12%
- F5: FPS=17, GPU=99%, CPU=12%
- F6: FPS=16, GPU=99%, CPU=12%
GeForce GTX 480:
- F2: FPS=7, GPU=53%, CPU=12%
- F3: FPS=9, GPU=99%, CPU=12%
- F4: FPS=23, GPU=99%, CPU=12%
- F5: FPS=24, GPU=99%, CPU=10%
- F6: FPS=23, GPU=99%, CPU=8%
100,000 instances, 800 tri/instance = 80,000,000 tri
Radeon HD 5870:
- F2: FPS=9, GPU=99%, CPU=12%
- F3: FPS=9, GPU=99%, CPU=12%
- F4: FPS=9, GPU=99%, CPU=12%
- F5: FPS=10, GPU=99%, CPU=12%
- F6: FPS=10, GPU=99%, CPU=12%
GeForce GTX 480:
- F2: FPS=7, GPU=56%, CPU=12%
- F3: FPS=9, GPU=99%, CPU=12%
- F4: FPS=14, GPU=99%, CPU=7%
- F5: FPS=14, GPU=99%, CPU=7%
- F6: FPS=14, GPU=99%, CPU=6%
100,000 instances, 1800 tri/instance = 180,000,000 tri
Radeon HD 5870:
- F2: FPS=5, GPU=99%, CPU=8%
- F3: FPS=5, GPU=99%, CPU=8%
- F4: FPS=5, GPU=99%, CPU=8%
- F5: FPS=5, GPU=99%, CPU=12%
- F6: FPS=5, GPU usage: 99%, CPU=12%
GeForce GTX 480:
- F2: FPS=6, GPU=99%, CPU=12%
- F3: FPS=6, GPU=99%, CPU=9%
- F4: FPS=7, GPU=99%, CPU=4%
- F5: FPS=7, GPU=99%, CPU=3%
- F6: FPS=7, GPU=99%, CPU=4%
Quick analysis:
- we understand why NVIDIA has called the technique using persistent vertex attributes Pseudo-Instancing (key F4). OpenGL glDrawElements() function is extremly fast and persistent vertex attributes require less overhead than uniforms to be passed to the vertex shader. Both coupled together give this performance boost. Very often, Pseudo-Instancing is almost as fast as HW GI. But Pseudo-Instancing is only fast on NVIDIA hardware.
- The benefit of hardware geometry instancing is mostly visible with few triangles per instance: 500 faces per instance seems a limit. With many triangles per instance, all techniques are similar.
- With many triangles per instance (1800 tri) and with many instances (100,000), NVIDIA drivers does not require much CPU power compared to ATI drivers.
Do not hesitate to comment this article and don’t panic if your comments are not quickly approved. I’m moving and I won’t be connected during the next few days. I’ll approve all comments as soon as possible!
See you later my friends!

Geometry instancing could help me for my move
Related posts:
- OpenGL 3.2 Geometry Instancing Culling on GPU Demo
- [Test] OpenGL 4.0 and Direct3D 11 Tessellation: GTX 480 vs GTX 470 vs HD 5870 vs HD 5770
- [Quick Test] Unigine Heaven 2.1: GTX 480 vs GTX 470 vs HD 5870 in OpenGL 4.0 and Direct3D 11 in Extreme Tessellation
- OpenGL Geometry Instancing DemoPack
- [TEST] AvP DX11 Tessellation Battle: GTX 480 vs GTX 470 vs HD 5870 vs HD 5770


















[...] article has been updated with new demos and new GI technique. Read the complete article here: OpenGL Geometry Instancing: GeForce GTX 480 vs Radeon HD 5870. This demo uses instancing techniques (simple instancing, pseudo-instancing and geometry [...]
It would be interesting to see how the test behave with different instancing methods: instanced array, texture buffer, etc. Also 1800 triangles isn’t so much for the maximum.
Nice work anyway!
Thanks Mr Groove!
I’ll update this demopack with new techniques next time (at least with instanced array). And I’ll increase the number of polygons per instance
FWIW, note that with more triangles, you’ll be hitting harder the triangle setup bottleneck of one triangle per clock cycle. Don’t know for the NVidia GTX 480, but this limit applies for the ATI R5xxx ; I don’t think that going beyond 180M triangles will bring you anything good for this generation of boards.
Nice summary of instanciation techniques and perfs though. Did you try to have finer grained timings with ARB_timer_query extension ?
Cheers
Yes keep these demo and tutorials coming. I enjoy reading them as it keeps me up to date on the newest features I can do with OpenGL. Plus it’s nice to see the old vs. new method of doing the same thing so once can make his own decision on what to do.
Thanks!!! keep up the good work!
20,000 instances x 18 tri/instance = 360,000 tri
Geforce GTX 470@ 700 core /1800 mem
- F2: FPS=18, GPU=20%, CPU=30%
- F3: FPS=40, GPU=21%, CPU=35%
- F4: FPS=68, GPU=15%, CPU=35%
- F5: FPS=96, GPU=19%, CPU=35%
- F6: FPS=101, GPU=24%, CPU=30%
100,000 instances, 1800 tri/instance = 180,000,000 tri
Geforce GTX 470@ 700 core /1800 mem
- F2: FPS=4, GPU=67%, CPU=30%
- F3: FPS=6, GPU=99%, CPU=24%
- F4: FPS=6, GPU=99%, CPU=14%
- F5: FPS=6, GPU=99%, CPU=12%
- F6: FPS=6, GPU=99%, CPU=10%
interesting…CPU usage goes down as GPU goes up..i thought the cpu would be less stressed with the lower geometry count…
I tested with a Radeon HD 2400, Catalyst 10.6.
The F6 technique half-failed: the asteroids are there, rotating, but all shading on them is turned off (all black). But the middle planet is shaded.
I suppose this was not intended.
The driver exposes all the 3 required extensions.
20,000 instances x 18 tri/instance = 360,000 tri
ATI HD4770 @ 940 core / 4800 mem
- F2: FPS=44, GPU=50%, CPU=28%
- F3: FPS=57, GPU=58%, CPU=30%
- F4: FPS=54, GPU=40%, CPU=30%
- F5: FPS=140, GPU=42%, CPU=34%
- F6: FPS=152, GPU=34%, CPU=30% (no shadinng)
100,000 instances, 1800 tri/instance = 180,000,000 tri
ATI HD4770 @ 940 core / 4800 mem
- F2: FPS=5, GPU=99%, CPU=18%
- F3: FPS=5, GPU=99%, CPU=16%
- F4: FPS=5, GPU=99%, CPU=16%
- F5: FPS=4-8, GPU=99%, CPU=12-28%
- F6: FPS=1-6, GPU=99%, CPU=6-28% (no shadinng)
Same thing as Matumbo here. But i have Radeon 4850 on Win7 and Cat. 10.5
F6 – asteroids are all black with all triangle variations (different EXEs).
Could you share your source code please (both C++ and GLSL), in order to learn advanced techniques and programming in OpenGL?
Omg, make a GRAPH, not text :p
Completely off topic but I think our friend JegX should make a post about this.
http://www.techreport.com/discussions.x/19216
hay JeGX R u died or what?????
Not publishing any new article from 3 to 4 days???
Fight with GirlFriend?????????
r u alive????????
where can I find the source of this demo ? or a similar source ? I get very bad performance with glDrawElements() (1000 cube instances max ) and glDrawElementsInstanced gives the same result
Yet I get a descent FPS when I run your demo (I’m on ATI 4850, opengl 2.1)