Link:
http://users.softlab.ece.ntua.gr/~ttsiod/mandelSSE.html Last weekend, I got to play with an NVIDIA GT240 (around 100$). Having read a lot of blogs about GPU programming, I downloaded the CUDA SDK and started reading some samples.
In less than one hour, I went from my rather complex SSE inline assembly, to a simple, clear Mandelbrot implementation... that run... 15 times faster!
Let me say this again: 1500% faster. Jaw dropping. Or put a different way: I went from 147fps at 320x240... to 210fps... at 1024x768!
I only have one comment for my fellow developers: It is clear that I was lucky - the algorithm in question was perfect for a CUDA implementation. You won't always get this kind of speedups (while at the same time doing it with clearer and significantly less code).
But what I am saying, is that you must start looking into these things: CUDA, OpenCL, etc.
_global__ void CoreLoop( int *p,
float xld, float yld, /* Left-Down coordinates */
float xru, float yru, /* Right-Up coordinates */
int MAXX, int MAXY) /* Window size */
{
float re,im,rez,imz;
float t1, t2, o1, o2;
int k;
unsigned result = 0;
unsigned idx = blockIdx.x*blockDim.x + threadIdx.x;
int y = idx / MAXX;
int x = idx % MAXX;
re = (float) xld + (xru-xld)*x/MAXX;
im = (float) yld + (yru-yld)*y/MAXY;
rez = 0.0f;
imz = 0.0f;
k = 0;
while (k < ITERA)
{
o1 = rez * rez;
o2 = imz * imz;
t2 = 2 * rez * imz;
t1 = o1 - o2;
rez = t1 + re;
imz = t2 + im;
if (o1 + o2 > 4)
{
result = k;
break;
}
k++;
}
p[y*MAXX + x] = lookup[result]; // Palettized lookup
}