discusses performance optimizations for AMD GPUs and CPUs using as a case study a simple, yet widely used computationally intensive kernel: Diagonal Sparse Matrix Vector Multiplication. We look at several topics which come up during OpenCL™ performance optimization and apply them to our case study:
1. Translating C code to OpenCL™
2. Choosing data structures for dense, aligned memory accesses
3. Using local, on-chip memory
4. Vectorizing the computation for higher efficiency
5. Using OpenCL™ images to improve effective memory bandwidth
6. Parallelism for multicore processors
At the end of our journey, we'll have a high-performance kernel for both the AMD Radeon™ HD 5870 GPU, as well as the AMD Phenom™ II X4 965 CPU.