Author Topic: Optimization - SIMD (parts 1 and 2)  (Read 3854 times)

0 Members and 1 Guest are viewing this topic.


  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2579
Optimization - SIMD (parts 1 and 2)
« on: May 18, 2020, 05:57:28 PM »
his is the third article in the series of posts on Optimization, which accompany online lectures for the IGAD program of the Breda University of Applied Sciences. You can find the first post (on profiling) here. The second post, on low level optimization, introduced the Rules of Engagement, which can help when approaching the bottlenecks found using profiling. The last Rule, #7, is: do things simultaneously, but this was not discussed in detail.

Doing things simultaneously means: using multiple cores, or multiple processors, or a CPU and a GPU. However, before we start spawning threads, we can do things in parallel in a single thread, using a concept that is called instruction level parallelism (ILP).

ILP happens naturally in a superscalar CPU pipeline, where multiple instructions are fetched, decoded and executed in each cycle, at least under the right conditions. A second form of ILP is provided by certain complex CPU instructions that perform multiple operations. An example is fused multiply and add (FMA), which performs a calculation like a=b⋅c+da=b\sdot c+da=b⋅c+d using a single assembler instruction. A particularly interesting class of complex instructions are the vector operations, which apply a single operation on multiple data. When applied correctly, single instruction multiple data (SIMD) can yield great speedups, of four or eight times, and sometimes more. This doesn’t hinder multi-core processing either: whatever gains we obtain in a single thread using SIMD code scales up even further when we finally do spawn those extra threads.