This article discusses performance optimizations for AMD GPUs and CPUs using as a case study a simple, yet widely used computationally intensive kernel: Diagonal Sparse Matrix Vector Multiplication. We look at several topics which come up during OpenCL™ performance optimization and apply them to our case study:

1. Translating C code to OpenCL™

2. Choosing data structures for dense, aligned memory accesses

3. Using local, on-chip memory

4. Vectorizing the computation for higher efficiency

5. Using OpenCL™ images to improve effective memory bandwidth

6. Parallelism for multicore processors

At the end of our journey, we'll have a high-performance kernel for both the AMD Radeon™ HD 5870 GPU, as well as the AMD Phenom™ II X4 965 CPU.