(Tested) OpenGL Geometry Instancing: GeForce GTX 480 vs Radeon HD 5870

2010/06/29 JeGX

Article Index

3 – Geometry Instancing Techniques

Here is the description of each geometry instancing technique I used:

F2 key: simple instancing: there is one source for geometry (a mesh) and this geometry is rendered for each instance. The tranformation matrix of each instance is calculated on the CPU. OpenGL rendering uses the glDrawElements() function. The number of draw calls is equal to the number of instances. The most simple and inefficient geometry instancing technique…

F3 key: slow pseudo-instancing: there is one source for geometry (a mesh) and this geometry is rendered for each instance. Now the tranformation matrix is computed on the GPU and per-instance data is passed via uniform variables (a vec4 for teh position and a vec4 for the orientation). OpenGL rendering uses the glDrawElements() function. The number of draw calls is equal to the number of instances. This technique is faster than F2 (simple instancing).

F4 key: Pseudo-Instancing: there is one source for geometry (a mesh) and this geometry is rendered for each instance. The tranformation matrix is computed on the GPU and per-instance data is passed via persistent vertex attributes. Persistent vertex attributes are for example the normal, the texture coordinates or the color (respectively set with glNormal(), glMultiTexCoord() and glColor()). This technique has been shown by NVIDIA in the following whitepaper: GLSL Pseudo-Instancing. OpenGL rendering uses the glDrawElements() function. The number of draw calls is equal to the number of instances. pseudo-instancing is extremely efficient on NVIDIA hardware. See results below…

F5 key: geometry instancing: it’s the real hardware instancing (HW GI). There is one source for geometry (a mesh) and rendering is done by batchs of 64 instances per draw-call. Actually on NVIDIA hardware, 400 instances can be rendered with one draw call but that does not work on ATI due to the limitation of the number of vertex uniforms. 64 instances per batch work fine on both ATI and NVIDIA. The tranformation matrix is computed on the GPU and per-batch data is passed via uniform arrays: there is an uniforn array of vec4 for positions and another vec4 array for rotations. OpenGL rendering uses the glDrawElementsInstancedARB() function. The GL_ARB_draw_instanced extension is required. The HW GI allows to drastically reduce the number of draw calls: for the 20,000-asteroid belt, we have 20000/64 = 313 draw calls instead of 20,000.

F6 key: geometry instancing with uniform buffer: it’s still the real hardware instancing (HW GI). There is one source for geometry (a mesh) and rendering is done by batchs of… 1000 instances per draw-call. A 1000-instance batch works fine on the GTX 480 and the HD 5870. The tranformation matrix is computed on the GPU and per-batch data is passed via a big buffer of uniforms: an uniform buffer object or UBO. This technique requires the support of the GL_ARB_uniform_buffer_object extension. UBO allows a huge reduction of the number of draw calls: for 20,000 instances and 1000 instances per draw call, the complete asteroid belt requires 20 draw calls! Like the previous GI technique, OpenGL rendering uses the glDrawElementsInstancedARB() function.

Pages: 1 2 3 4 5

19 thoughts on “(Tested) OpenGL Geometry Instancing: GeForce GTX 480 vs Radeon HD 5870”

Pingback: OpenGL Geometry InstancingJeGX's Infamous Lab | JeGX's Infamous Lab
Groovounet 2010/06/29 at 23:50

It would be interesting to see how the test behave with different instancing methods: instanced array, texture buffer, etc. Also 1800 triangles isn’t so much for the maximum.
Groovounet 2010/06/29 at 23:50

Nice work anyway!
JeGX Post Author2010/06/30 at 00:02

Thanks Mr Groove!
I’ll update this demopack with new techniques next time (at least with instanced array). And I’ll increase the number of polygons per instance 😉
nicolas 2010/06/30 at 00:31

FWIW, note that with more triangles, you’ll be hitting harder the triangle setup bottleneck of one triangle per clock cycle. Don’t know for the NVidia GTX 480, but this limit applies for the ATI R5xxx ; I don’t think that going beyond 180M triangles will bring you anything good for this generation of boards.

Nice summary of instanciation techniques and perfs though. Did you try to have finer grained timings with ARB_timer_query extension ?

Cheers
Mars_999 2010/06/30 at 01:52

Yes keep these demo and tutorials coming. I enjoy reading them as it keeps me up to date on the newest features I can do with OpenGL. Plus it’s nice to see the old vs. new method of doing the same thing so once can make his own decision on what to do.

Thanks!!! keep up the good work!
WacKEDmaN 2010/06/30 at 05:56

20,000 instances x 18 tri/instance = 360,000 tri
Geforce GTX 470@ 700 core /1800 mem
– F2: FPS=18, GPU=20%, CPU=30%
– F3: FPS=40, GPU=21%, CPU=35%
– F4: FPS=68, GPU=15%, CPU=35%
– F5: FPS=96, GPU=19%, CPU=35%
– F6: FPS=101, GPU=24%, CPU=30%

100,000 instances, 1800 tri/instance = 180,000,000 tri
Geforce GTX 470@ 700 core /1800 mem
– F2: FPS=4, GPU=67%, CPU=30%
– F3: FPS=6, GPU=99%, CPU=24%
– F4: FPS=6, GPU=99%, CPU=14%
– F5: FPS=6, GPU=99%, CPU=12%
– F6: FPS=6, GPU=99%, CPU=10%

interesting…CPU usage goes down as GPU goes up..i thought the cpu would be less stressed with the lower geometry count…
Matumbo 2010/06/30 at 10:07

I tested with a Radeon HD 2400, Catalyst 10.6.
The F6 technique half-failed: the asteroids are there, rotating, but all shading on them is turned off (all black). But the middle planet is shaded.
I suppose this was not intended.
The driver exposes all the 3 required extensions.
IVXXX 2010/06/30 at 11:10

20,000 instances x 18 tri/instance = 360,000 tri
ATI HD4770 @ 940 core / 4800 mem
– F2: FPS=44, GPU=50%, CPU=28%
– F3: FPS=57, GPU=58%, CPU=30%
– F4: FPS=54, GPU=40%, CPU=30%
– F5: FPS=140, GPU=42%, CPU=34%
– F6: FPS=152, GPU=34%, CPU=30% (no shadinng)

100,000 instances, 1800 tri/instance = 180,000,000 tri
ATI HD4770 @ 940 core / 4800 mem
– F2: FPS=5, GPU=99%, CPU=18%
– F3: FPS=5, GPU=99%, CPU=16%
– F4: FPS=5, GPU=99%, CPU=16%
– F5: FPS=4-8, GPU=99%, CPU=12-28%
– F6: FPS=1-6, GPU=99%, CPU=6-28% (no shadinng)
ca$per 2010/06/30 at 11:38

Same thing as Matumbo here. But i have Radeon 4850 on Win7 and Cat. 10.5
F6 – asteroids are all black with all triangle variations (different EXEs).
TopLess3D 2010/06/30 at 16:32

Could you share your source code please (both C++ and GLSL), in order to learn advanced techniques and programming in OpenGL?
ddd 2010/06/30 at 19:23

Omg, make a GRAPH, not text :p
Psolord 2010/07/08 at 13:47

Completely off topic but I think our friend JegX should make a post about this.
http://www.techreport.com/discussions.x/19216
krishx007 2010/07/08 at 17:15

hay JeGX R u died or what?????

Not publishing any new article from 3 to 4 days???

Fight with GirlFriend?????????
krishx007 2010/07/08 at 17:16

r u alive????????
codablank 2010/08/08 at 19:42

where can I find the source of this demo ? or a similar source ? I get very bad performance with glDrawElements() (1000 cube instances max ) and glDrawElementsInstanced gives the same result

Yet I get a descent FPS when I run your demo (I’m on ATI 4850, opengl 2.1)
Ryan 2010/11/09 at 03:27

Yeah, the source for this would be really handy!
DarkUltra 2011/01/04 at 15:50

180 million eh? According to 3dmark 01, my gtx 470 does 400 million.

http://jooh.no/web/GeForce_8600_GT_vs_GTX_470_polygon_performance.png

Instancing as in the article however, does not use as much cpu power, but is also less useful than “real” cpu aware polygons.

the 470 fermi have several setup engines running in parallel instead of just one as all the previous gpus (ati and nvidia)

please delete my previous post
OnlyLinuxLovesUBack 2011/08/13 at 20:42

where do i get the source code?

Thanks!

Comments are closed.