« on: March 18, 2015, 02:55:59 PM »
Unlike the competition, Intel’s shader hardware has a full set of registers dedicated to each hardware thread. The red and green team each lose thread occupancy if a shader has a lot of register pressure, but not the blue team, they just exploit their ridiculous process advantage and pack the little suckers in, and then stop worrying about it. Our shader has quite a bit of register pressure in it, but that doesn’t hurt Intel’s concurrency one bit. Their enormous register file functions as a big on-chip buffer.
Even though it is possible to implement geometry shaders efficiently, the fact that two of the three vendors don’t do it that way means that the GS is not a practical choice for production use. It should be avoided wherever possible.
It is flawed, in that it injects a serialized, high bandwidth operation into an already serialized part of the pipeline. It requires a lot of per-thread storage. It is clearly a very unnatural fit for wide SIMD machines. However, this little exercise has made me wonder if it can’t be redeemed by spreading a single instance across multiple warps/wavefronts, squeezing ILP out of a DLP architecture. Perhaps I’ll try and write a compute shader that does this.
Complete story: http://www.joshbarczak.com/blog/?p=667