So far I’ve covered how SlimGen works and the difficulties in doing what it does, including calling convention issues that one must be made aware of when writing replacement methods for use with SlimGen.

So the next question arises, just how much of a difference can using SlimGen make? Well, a lot of that will depend on the developer and their skill level. But we also were pretty curious about this and so we slapped together a test sample that runs through a series of matrix multiplications and times it. It uses three arrays to perform the multiplications, two of the arrays contains 100,000 randomly generated matrixes, with the third being used as the destinations for the results. Both matrix multiplications (the SlimGen one and the .Net one) assume that a source can also be used as a destination, and so they are overlap safe.

The timing results will vary, of course, from machine to machine depending on the processor in the machine, how much ram you have and also on what you’re doing at the time. Running the results against my Phenom 9850 I get:

```
Total Matrix Count Per Run: 100,000
Multiply Total Ticks: 2,001,059
SlimGenMultiply Total Ticks: 1,269,200
Improvement: 36.57 %
```

While when I run it against my T8300 Core2 Duo laptop I get:

```
Total Matrix Count Per Run: 100,000
Multiply Total Ticks: 2,175,380
SlimGenMultiply Total Ticks: 1,621,830
Improvement: 25.45 %
```

Still, 25-35% improvement over the FPU based multiply is quite significant. Since X64 support hasn’t been fully hammered out (in that it “works” but hasn’t been sufficiently verified as working), those numbers are unavailable at the moment. However, they should be available in the near future as we finalize error handling and ensure that there are no bugs in the x64 assembly handling.

So why the great difference in performance? Well, part of it is the method size, the .Net method is 566 bytes of pure code, that’s over half a kilobyte of code that has to be walked through by the processor, code which needs to be brought into the instruction-cache on the CPU and executed, meanwhile the SSE2 method is around half that size, at 266 bytes. The smaller your footprint in the I-cache, the fewer hits you take and the more likely your code is to actually be IN the I-cache. Then there’s the instructions, SSE2 has been around for a while, and so it has had plenty of time to be wrangled around with by CPU manufacturers to ensure optimal performance. Finally there’s the memory hit issue, the SSE2 based code hits memory a minimal number of times, reducing the chances of cache misses, after the first read/write, except for a few cases.

Finally there’s how it deals with storage of the temporary results. The .Net FPU based version allocates a Matrix type on the stack, calls the constructor (which 0 initializes it), and then proceeds to overwrite those entries one by one with the results of each set of dot products. At the end of the method it does what amounts to a memcpy, and copies the temporary matrix over the result matrix. The SSE2 version however doesn’t bother with initializing the stack and only stores three of the results on the stack, opting to write out the final result directly to the destination. The three other rows are then moved back into XMM registers and then back out to the destination.

The SSE2 source code, followed by the .Net source code, note that both are functionally equivalent:

```
start: mov eax, [esp + 4]
movups xmm4, [edx]
movups xmm5, [edx + 0x10]
movups xmm6, [edx + 0x20]
movups xmm7, [edx + 0x30]
movups xmm0, [ecx]
movaps xmm1, xmm0
movaps xmm2, xmm0
movaps xmm3, xmm0
shufps xmm0, xmm1, 0x00
shufps xmm1, xmm1, 0x55
shufps xmm2, xmm2, 0xAA
shufps xmm3, xmm3, 0xFF
mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
addps xmm0, xmm2
addps xmm1, xmm3
addps xmm0, xmm1
movups [esp - 0x20], xmm0 ; store row 0 of new matrix
movups xmm0, [ecx + 0x10]
movaps xmm1, xmm0
movaps xmm2, xmm0
movaps xmm3, xmm0
shufps xmm0, xmm0, 0x00
shufps xmm1, xmm1, 0x55
shufps xmm2, xmm2, 0xAA
shufps xmm3, xmm3, 0xFF
mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
addps xmm0, xmm2
addps xmm1, xmm3
addps xmm0, xmm1
movups [esp - 0x30], xmm0 ; store row 1 of new matrix
movups xmm0, [ecx + 0x20]
movaps xmm1, xmm0
movaps xmm2, xmm0
movaps xmm3, xmm0
shufps xmm0, xmm0, 0x00
shufps xmm1, xmm1, 0x55
shufps xmm2, xmm2, 0xAA
shufps xmm3, xmm3, 0xFF
mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
addps xmm0, xmm2
addps xmm1, xmm3
addps xmm0, xmm1
movups [esp - 0x40], xmm0 ; store row 2 of new matrix
movups xmm0, [ecx + 0x30]
movaps xmm1, xmm0
movaps xmm2, xmm0
movaps xmm3, xmm0
shufps xmm0, xmm0, 0x00
shufps xmm1, xmm1, 0x55
shufps xmm2, xmm2, 0xAA
shufps xmm3, xmm3, 0xFF
mulps xmm0, xmm4
mulps xmm1, xmm5
mulps xmm2, xmm6
mulps xmm3, xmm7
addps xmm0, xmm2
addps xmm1, xmm3
addps xmm0, xmm1
movups [eax + 0x30], xmm0 ; store row 3 of new matrix
movups xmm0, [esp - 0x40]
movups [eax + 0x20], xmm0
movups xmm0, [esp - 0x30]
movups [eax + 0x10], xmm0
movups xmm0, [esp - 0x20]
movups [eax], xmm0
ret 4
```

The .Net matrix multiplication source code:

```
public static void Multiply(ref Matrix left, ref Matrix right, out Matrix result) {
Matrix r;
r.M11 = (left.M11 * right.M11) + (left.M12 * right.M21) + (left.M13 * right.M31) + (left.M14 * right.M41);
r.M12 = (left.M11 * right.M12) + (left.M12 * right.M22) + (left.M13 * right.M32) + (left.M14 * right.M42);
r.M13 = (left.M11 * right.M13) + (left.M12 * right.M23) + (left.M13 * right.M33) + (left.M14 * right.M43);
r.M14 = (left.M11 * right.M14) + (left.M12 * right.M24) + (left.M13 * right.M34) + (left.M14 * right.M44);
r.M21 = (left.M21 * right.M11) + (left.M22 * right.M21) + (left.M23 * right.M31) + (left.M24 * right.M41);
r.M22 = (left.M21 * right.M12) + (left.M22 * right.M22) + (left.M23 * right.M32) + (left.M24 * right.M42);
r.M23 = (left.M21 * right.M13) + (left.M22 * right.M23) + (left.M23 * right.M33) + (left.M24 * right.M43);
r.M24 = (left.M21 * right.M14) + (left.M22 * right.M24) + (left.M23 * right.M34) + (left.M24 * right.M44);
r.M31 = (left.M31 * right.M11) + (left.M32 * right.M21) + (left.M33 * right.M31) + (left.M34 * right.M41);
r.M32 = (left.M31 * right.M12) + (left.M32 * right.M22) + (left.M33 * right.M32) + (left.M34 * right.M42);
r.M33 = (left.M31 * right.M13) + (left.M32 * right.M23) + (left.M33 * right.M33) + (left.M34 * right.M43);
r.M34 = (left.M31 * right.M14) + (left.M32 * right.M24) + (left.M33 * right.M34) + (left.M34 * right.M44);
r.M41 = (left.M41 * right.M11) + (left.M42 * right.M21) + (left.M43 * right.M31) + (left.M44 * right.M41);
r.M42 = (left.M41 * right.M12) + (left.M42 * right.M22) + (left.M43 * right.M32) + (left.M44 * right.M42);
r.M43 = (left.M41 * right.M13) + (left.M42 * right.M23) + (left.M43 * right.M33) + (left.M44 * right.M43);
r.M44 = (left.M41 * right.M14) + (left.M42 * right.M24) + (left.M43 * right.M34) + (left.M44 * right.M44);
result = r;
}
```