[Test] Simple x87 vs SSE2 Performance Test With Matrix Multiplication



SSE/SSE2 instruction set



In this news, we learnt that the current version of the PhysX engine was compiled with the x87 instruction set and is not using modern sets like SSE2. I found here a simple ready-to-use matrix multiplication code sample that will allow us to see the speed difference between x87 and SSE/SSE2 sets.

I compiled the following code sample with visual c++ 2005:

#include 
#include 
#pragma comment(lib, "winmm.lib")

#define DIM 4

void mul_mat(double **mat1, double **mat2, int sz)
{
  int ii, jj, kk;
  for (ii = 0; ii < sz; ii++)
  {
    double temp[DIM];
    for (jj = 0; jj < sz; jj++)
    {
      temp[jj] = 0.;
      for (kk = 0; kk < sz; kk++)
	temp[jj] += mat1[ii][kk]*mat2[kk][jj];
    }
    for (jj = 0; jj < sz; jj++)
      mat1[ii][jj] = temp[jj];
  }
}

int main()
{
  const int sz = 4;
  const int num = 1000000;
  int ii, jj;

  double **mat1 = new double*[sz];
  for (ii = 0; ii < sz; ii++)
    mat1[ii] = new double[sz];
  double **mat2 = new double*[sz];
  for (ii = 0; ii < sz; ii++)
    mat2[ii] = new double[sz];

  for (ii = 0; ii < sz; ii++)
    for (jj = 0; jj < sz; jj++)
    {
      mat1[ii][jj] = double(ii)/sz*double(jj)/sz;
      mat2[ii][jj] = double(ii)/sz*double(jj)/sz;
    }

  printf("\nStarting matrix multiplication loop...");
  DWORD start = timeGetTime();
    
  //
  // Main matrix loop:
  //
  for (ii = 0; ii < num; ii++)
    mul_mat(mat1, mat2, sz);
  
  DWORD end = timeGetTime();
  printf("\nElapsed time: %d ms\n", end-start);
	
  return 0;
}

As the author of this code says it, it’s a very artificial scenario, but it shows well the difference between the differents math instruction sets.

I limited the main loop counter to 1 million of matrix multiplications which is enough.

In the vs2005 project properties, I set Optimization to Minimize Size (/O1) and changed the Enchanced Instruction Set to test the different sets.

You can download the test pack here:
Download x87 / SSE2 test Version 2010.07.11


There are three exe in the pack: x87_test.exe, sse_test.exe and sse2_test.exe. I added a batch file for each exe in order to have a pause at the end.


Test 1 – Instruction Set: No Set or x87
– Elapsed time: 2373 ms

Test 2 – Instruction Set: SSE
– Elapsed time: 2368 ms

Test 3 – Instruction Set: SSE2
– Elapsed time: 1112 ms

No doubt, SSE2 is the way to get fast math. A recompilation of the PhysX engine with SSE2 instruction set would be very nice. But as I said in this news, a simple recompilation might lead to some incorrect calculation results, so NVIDIA will have to test such a recompilation before, which may take some time…

Here is the assembly output of the core of matrix multiplication of the different instruction sets:

temp[jj] += mat1[ii][kk]*mat2[kk][jj];

Set: No Set

00431A89  mov         eax,dword ptr [ii] 
00431A8C  mov         ecx,dword ptr [mat1] 
00431A8F  mov         edx,dword ptr [ecx+eax*4] 
00431A92  mov         eax,dword ptr [kk] 
00431A95  mov         ecx,dword ptr [mat2] 
00431A98  mov         eax,dword ptr [ecx+eax*4] 
00431A9B  mov         ecx,dword ptr [kk] 
00431A9E  mov         esi,dword ptr [jj] 
00431AA1  fld         qword ptr [edx+ecx*8] 
00431AA4  fmul        qword ptr [eax+esi*8] 
00431AA7  mov         edx,dword ptr [jj] 
00431AAA  fadd        qword ptr temp[edx*8] 
00431AAE  mov         eax,dword ptr [jj] 
00431AB1  fstp        qword ptr temp[eax*8] 
00431AB5  jmp         mul_mat+68h (431A78h) 

Set: SSE Set

00431A89  mov         eax,dword ptr [ii] 
00431A8C  mov         ecx,dword ptr [mat1] 
00431A8F  mov         edx,dword ptr [ecx+eax*4] 
00431A92  mov         eax,dword ptr [kk] 
00431A95  mov         ecx,dword ptr [mat2] 
00431A98  mov         eax,dword ptr [ecx+eax*4] 
00431A9B  mov         ecx,dword ptr [kk] 
00431A9E  mov         esi,dword ptr [jj] 
00431AA1  fld         qword ptr [edx+ecx*8] 
00431AA4  fmul        qword ptr [eax+esi*8] 
00431AA7  mov         edx,dword ptr [jj] 
00431AAA  fadd        qword ptr temp[edx*8] 
00431AAE  mov         eax,dword ptr [jj] 
00431AB1  fstp        qword ptr temp[eax*8] 
00431AB5  jmp         mul_mat+68h (431A78h) 

Set: SSE2 Set

00431A91  mov         eax,dword ptr [ii] 
00431A94  mov         ecx,dword ptr [mat1] 
00431A97  mov         edx,dword ptr [ecx+eax*4] 
00431A9A  mov         eax,dword ptr [kk] 
00431A9D  mov         ecx,dword ptr [mat2] 
00431AA0  mov         eax,dword ptr [ecx+eax*4] 
00431AA3  mov         ecx,dword ptr [kk] 
00431AA6  mov         esi,dword ptr [jj] 
00431AA9  movsd       xmm0,mmword ptr [edx+ecx*8] 
00431AAE  mulsd       xmm0,mmword ptr [eax+esi*8] 
00431AB3  mov         edx,dword ptr [jj] 
00431AB6  addsd       xmm0,mmword ptr temp[edx*8] 
00431ABC  mov         eax,dword ptr [jj] 
00431ABF  movsd       mmword ptr temp[eax*8],xmm0 
00431AC5  jmp         mul_mat+70h (431A80h) 

No Set and SSE set have the same code (fp87 instructions????) but SSE2 code really uses another codepath with the use of SSE registers such as xmm0.




Geeks3D.com

↑ Grab this Headline Animator