SlimGen and You, Part ADD [EAX], AL of N

Imagine you could have the safety of managed code, and the speed of SIMD all in one? Sounds like one of those weird dreams Trent has, or perhaps you are already thinking of using C++/CLI to wrap SIMD methods to help reduce the unmanaged transition overhead. You might also be thinking about pinvoking DLL methods such as those used in the D3DX framework to take advantage of its SIMD capabilities.

While all of those are quite possible, and for sufficiently large problems quite efficient too, they also have a relatively high cost of invocation. Managed to unmanaged transitions, even in the best of cases, costs a pretty penny. Registers have to be saved, marshalling of non-fundamental types has to be performed, and in many cases an interop thunk has to be created/jitted. This is a case where the best option is to do as much work as you can in one area before transitioning to the next.

But you can’t always do tons of work at once, a prime example is that of managing your game state. You’ll have discrete transformations of objects, but batching up those transformations to perform them all at once because a management nightmare. You have to craft special data-structures to avoid marshalling, use pinned arrays, and in general you end up doing a lot of work maintaining the two, will spend plenty of time debugging your interface, and may actually not gain anything speed wise still.

If you’re wondering just how bad the interop transition is, you can take a look at my previous entries, where I explored the topic in some detail.

In the .Net framework, most code runs almost as fast, as fast, or faster than the comparable native counterparts. There are cases where the framework is significantly faster, and cases where it loses out at about 10% in the worst case. 10% isn’t a horrible loss, and it’s not a consistent loss either. The cost will vary depending on factors such as: is JITing required, is memory allocation performed, are you doing FPU math that would be vectorized in native code?

In fact, that 10% figure isn’t accurate either: If a method requires JITting the first time it is called, which could cost you 10% on the first invocation, future invocations will not need JITing and so the cost may end up being the same as its native counterpart henceforth. If the method is called a thousand times, then that’s only an additional .01% cost over the entire set of invocations.

The only real area that the .Net framework seriously loses out to unmanaged code is in the math department. The inability to use vectorization can significantly increase the cost of managed math over that of unamanged math code, that 10% figure rears its ugly head here. On the integer math side of things managed code is almost on equal footing with unmanaged code, although there are some vectorized operations you can perform that will enhance integer operations quite significantly, but in general the two add up to be about the same. However when it comes to floating point performance managed code loses out due to its dependency on the FPU or single float SSE instructions. The ability to vectorize large chunks of floating point math can work wonders for unmanaged code.

Well, all is not lost for those of us who love the managed world… SlimGen is here. Exactly what SlimGen is will be delved into later, but here’s a sample preview of what it can do:

SlimDX.Matrix.Multiply(SlimDX.Matrix ByRef, SlimDX.Matrix ByRef, SlimDX.Matrix ByRef)  
Begin 5a856e64, size 293  
5A856E64 8B442404         mov         eax,dword ptr [esp+4]  
5A856E68 0F1022           movups      xmm4,xmmword ptr [edx]  
5A856E6B 0F106A10         movups      xmm5,xmmword ptr [edx+10h]  
5A856E6F 0F107220         movups      xmm6,xmmword ptr [edx+20h]  
5A856E73 0F107A30         movups      xmm7,xmmword ptr [edx+30h]  
5A856E77 0F1001           movups      xmm0,xmmword ptr [ecx]  
5A856E7A 0F28C8           movaps      xmm1,xmm0