SlimGen and You, Part ADD EAX, [EAX] of N

So far I’ve covered how SlimGen works and the difficulties in doing what it does, including calling convention issues that one must be made aware of when writing replacement methods for use with SlimGen.

So the next question arises, just how much of a difference can using SlimGen make? Well, a lot of that will depend on the developer and their skill level. But we also were pretty curious about this and so we slapped together a test sample that runs through a series of matrix multiplications and times it. It uses three arrays to perform the multiplications, two of the arrays contains 100,000 randomly generated matrixes, with the third being used as the destinations for the results. Both matrix multiplications (the SlimGen one and the .Net one) assume that a source can also be used as a destination, and so they are overlap safe.

The timing results will vary, of course, from machine to machine depending on the processor in the machine, how much ram you have and also on what you’re doing at the time. Running the results against my Phenom 9850 I get:

Total Matrix Count Per Run:  100,000  
Multiply        Total Ticks: 2,001,059  
SlimGenMultiply Total Ticks: 1,269,200  
Improvement:                 36.57 %  

While when I run it against my T8300 Core2 Duo laptop I get:

Total Matrix Count Per Run:  100,000  
Multiply        Total Ticks: 2,175,380  
SlimGenMultiply Total Ticks: 1,621,830  
Improvement:                 25.45 %  

Still, 25-35% improvement over the FPU based multiply is quite significant. Since X64 support hasn’t been fully hammered out (in that it “works” but hasn’t been sufficiently verified as working), those numbers are unavailable at the moment. However, they should be available in the near future as we finalize error handling and ensure that there are no bugs in the x64 assembly handling.

So why the great difference in performance? Well, part of it is the method size, the .Net method is 566 bytes of pure code, that’s over half a kilobyte of code that has to be walked through by the processor, code which needs to be brought into the instruction-cache on the CPU and executed, meanwhile the SSE2 method is around half that size, at 266 bytes. The smaller your footprint in the I-cache, the fewer hits you take and the more likely your code is to actually be IN the I-cache. Then there’s the instructions, SSE2 has been around for a while, and so it has had plenty of time to be wrangled around with by CPU manufacturers to ensure optimal performance. Finally there’s the memory hit issue, the SSE2 based code hits memory a minimal number of times, reducing the chances of cache misses, after the first read/write, except for a few cases.

Finally there’s how it deals with storage of the temporary results. The .Net FPU based version allocates a Matrix type on the stack, calls the constructor (which 0 initializes it), and then proceeds to overwrite those entries one by one with the results of each set of dot products. At the end of the method it does what amounts to a memcpy, and copies the temporary matrix over the result matrix. The SSE2 version however doesn’t bother with initializing the stack and only stores three of the results on the stack, opting to write out the final result directly to the destination. The three other rows are then moved back into XMM registers and then back out to the destination.

The SSE2 source code, followed by the .Net source code, note that both are functionally equivalent:

start:      mov     eax, [esp + 4]  
            movups  xmm4, [edx]
            movups  xmm5, [edx + 0x10]
            movups  xmm6, [edx + 0x20]
            movups  xmm7, [edx + 0x30]

            movups  xmm0, [ecx]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm1, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF

            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1

            movups  [esp - 0x20], xmm0 ; store row 0 of new matrix

            movups  xmm0, [ecx + 0x10]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF

            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1

            movups  [esp - 0x30], xmm0 ; store row 1 of new matrix

            movups  xmm0, [ecx + 0x20]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF

            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1

            movups  [esp - 0x40], xmm0 ; store row 2 of new matrix

            movups  xmm0, [ecx + 0x30]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF

            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1

            movups  [eax + 0x30], xmm0 ; store row 3 of new matrix
            movups  xmm0, [esp - 0x40]
            movups  [eax + 0x20], xmm0
            movups  xmm0, [esp - 0x30]
            movups  [eax + 0x10], xmm0
            movups  xmm0, [esp - 0x20]
            movups  [eax], xmm0
            ret     4

The .Net matrix multiplication source code:

public static void Multiply(ref Matrix left, ref Matrix right, out Matrix result) {  
    Matrix r;
    r.M11 = (left.M11 * right.M11) + (left.M12 * right.M21) + (left.M13 * right.M31) + (left.M14 * right.M41);
    r.M12 = (left.M11 * right.M12) + (left.M12 * right.M22) + (left.M13 * right.M32) + (left.M14 * right.M42);
    r.M13 = (left.M11 * right.M13) + (left.M12 * right.M23) + (left.M13 * right.M33) + (left.M14 * right.M43);
    r.M14 = (left.M11 * right.M14) + (left.M12 * right.M24) + (left.M13 * right.M34) + (left.M14 * right.M44);
    r.M21 = (left.M21 * right.M11) + (left.M22 * right.M21) + (left.M23 * right.M31) + (left.M24 * right.M41);
    r.M22 = (left.M21 * right.M12) + (left.M22 * right.M22) + (left.M23 * right.M32) + (left.M24 * right.M42);
    r.M23 = (left.M21 * right.M13) + (left.M22 * right.M23) + (left.M23 * right.M33) + (left.M24 * right.M43);
    r.M24 = (left.M21 * right.M14) + (left.M22 * right.M24) + (left.M23 * right.M34) + (left.M24 * right.M44);
    r.M31 = (left.M31 * right.M11) + (left.M32 * right.M21) + (left.M33 * right.M31) + (left.M34 * right.M41);
    r.M32 = (left.M31 * right.M12) + (left.M32 * right.M22) + (left.M33 * right.M32) + (left.M34 * right.M42);
    r.M33 = (left.M31 * right.M13) + (left.M32 * right.M23) + (left.M33 * right.M33) + (left.M34 * right.M43);
    r.M34 = (left.M31 * right.M14) + (left.M32 * right.M24) + (left.M33 * right.M34) + (left.M34 * right.M44);
    r.M41 = (left.M41 * right.M11) + (left.M42 * right.M21) + (left.M43 * right.M31) + (left.M44 * right.M41);
    r.M42 = (left.M41 * right.M12) + (left.M42 * right.M22) + (left.M43 * right.M32) + (left.M44 * right.M42);
    r.M43 = (left.M41 * right.M13) + (left.M42 * right.M23) + (left.M43 * right.M33) + (left.M44 * right.M43);
    r.M44 = (left.M41 * right.M14) + (left.M42 * right.M24) + (left.M43 * right.M34) + (left.M44 * right.M44);
    result = r;
}