Playing With The .NET JIT Part 1

Introduction

.NET has been getting some interesting press recently. Even to the point where an article in Game Developer Magazine was published advocating the usage of managed code for rapid development of components. However, I did raise some issues with the author in regards to the performance metric he used. Thus it is that I have decided to cover some issue with .NET performance, future benefits, and hopefully even a few solutions to some of the problems I'll be posing.

Ultimately the performance of your application will be determined by the algorithms and data-structures you use . No amount of micro-optimization can hope to account for the huge performance differences that can crop up between different choices of algorithms. Thus the most important tool you can have in your arsenal is a decent profiler. Thankfully there are many good profilers available for the .NET platform. Some of the profiling tools are specific to certain areas of managed coding, such as the CLR Profiler, which is useful for profiling the allocation patterns of your managed application. Others, like DevPartner, allow you to profile the entire application, identifying performance bottlenecks in both managed and unmanaged code. Finally there are the low level profiling tools, such as the SOS Debugging Tools, these tools give you extremely detailed information about the performance of your systems but are hard to use.

Applications designed and built towards a managed platform tend to have different design decisions behind them than unmanaged applications. Even such fundamental things as memory allocation patterns are usually quite a bit different. With object lifetimes being non-deterministic, one has to apply different patterns to ensure the timely release of resources. Allocation patterns are also different, partly due to the inability to allocate objects on the stack, but also due to the ease of allocation on the managed heap. Allocating on an unmanaged heap typically requires a heap walk to find a block of free space that is at least the size of the block requested. The managed allocator typically allocates at the end of the heap, resulting in significantly faster allocation times (constant time, for the most part). These changes to the underlying assumptions that drive the system typically have large sweeping changes on the overall design of the systems.

Future Developments

Theoretically a JIT compiler can outperform a standard compiler simply because it can target the platform in ways that traditional compilation cannot. Traditionally, to target different instruction sets, you would have to compile a binary for each instruction set. For instance, targeting SSE2 would require you to build a separate binary from that of your non-SSE2 branch. You could, of course, do this through the use of DLLs, or by custom writing your SSE2 code and using function pointers to dictate which branch to chose.

Hand written SIMD code is often faster than compiler generated SIMD, due to the ability to manually vectorize the data thus enabling for true SIMD to take place. Some compilers, like the Intel C++ Compiler can perform automatic vectorization. However it is unable to guarantee the accuracy of the resulting binary and extensive testing typically has to be done in order to ensure that the functionality was correctly generated. While most compilers have the option to target SIMD instruction sets, they usually use it to replace standard floating point operations where they can, as the single based SIMD instructions are generally faster than their FPU counterparts.

The JIT compiler could target any SIMD instruction set supported by its platform, along with any other hardware specific optimizations it knew about. While automatic vectorization is not likely to be in a JIT release anytime soon, even using the non-vectorized SIMD instruction sets can help to parallelize your processing. As an example, multiple independent SIMD operations can typically run in parallel (that is, an add and a multiplication could both run simultaneously). Furthermore, the JIT can allow any .NET application to target any system it supports, provided the libraries it uses are also available on that system. This means that, provided you aren't doing anything highly non-portable such as assuming that a pointer is 32bits..., your application could be JIT compiled to target a 64 bit compiler and run natively that way.

Another area of potential advancement includes the realm of Profile Guided Optimization. Currently POGO is restricted to the arena of unmanaged applications, as it requires the ability to generate raw machine code and to perform instruction reordering. In essence you instrument an application with a POGO profiler; then you use the application normally to allow the profiler to collect usage data and to find the hotspots. Finally you run the optimizer on the solution, which will rebuild the solution, using the profiling data it gathered to optimize the heavily utilized sections of your application. A JIT compiler could instrument a managed program on first launch and watch its usage, while in another thread it could be optimizing the machine code using the profiling data that it gathers. The resulting cached binary image would be optimized on the next launch (excepting those areas that had not been accessed, and thus the JIT hadn't compiled yet). This would be especially effective on systems with multiple cores.

JIT Compilation for the x86

The JIT compiler for the x86 platform, as of .NET 2.0, does not support SIMD instruction sets. It will generate occasional MMX or SSE instructions for some integral and floating point promotions, but otherwise it will not utilize SIMD instruction sets. Inlining poses its own problems for the JIT compiler. Currently the JIT compiler will only inline functions that are 32 bytes of IL or smaller. Because the JIT compiler runs in an extremely tight time constraint, it is forced to make sacrifices in the optimizations it can make. Inlining is typically an expensive operation because it requires shuffling around the addresses of everything that comes after the inlined code (which requires interpreting the IL, then determining if its address is before or after the inlined code, then making the appropriate adjustments...). Because of this, all but the smallest of methods will not be inlined. Here's a sample of a method that will not be inlined, and the IL that accompanies it:

public float SquareMagnitude() {  
    return X * X + Y * Y + Z * Z;
}

.method public hidebysig instance float32 SquareMagnitude() cil managed
{
    .maxstack 8
    L_0001: ldfld float32 Performance_Tests.Vector3::X
    L_0006: ldarg.0
    L_0007: ldfld float32 Performance_Tests.Vector3::X
    L_000c: mul
    L_000d: ldarg.0
    L_000e: ldfld float32 Performance_Tests.Vector3::Y
    L_0013: ldarg.0
    L_0014: ldfld float32 Performance_Tests.Vector3::Y
    L_0019: mul
    L_001a: add
    L_001b: ldarg.0
    L_001c: ldfld float32 Performance_Tests.Vector3::Z
    L_0021: ldarg.0
    L_0022: ldfld float32 Performance_Tests.Vector3::Z
    L_0027: mul
    L_0028: add
    L_0029: ret
}

This method, as you can tell, is 42 bytes long, counting the return instruction. Clearly this is over the 32 byte IL limit. However, the resulting assembly compiles down to less than 25 bytes:

002802C0 D901             fld         dword ptr [ecx]
002802C2 D9C0             fld         st(0)
002802C4 DEC9             fmulp       st(1),st
002802C6 D94104           fld         dword ptr [ecx+4]
002802C9 D9C0             fld         st(0)
002802CB DEC9             fmulp       st(1),st
002802CD DEC1             faddp       st(1),st
002802CF D94108           fld         dword ptr [ecx+8]
002802D2 D9C0             fld         st(0)
002802D4 DEC9             fmulp       st(1),st
002802D6 DEC1             faddp       st(1),st
002802D8 C3               ret

Methods that use this one though, like the Magnitude method, may be candidates for inlining however. Which typically reduces to a call to the SquareMagnitude method and a fsqrt call.

Another area where the JIT has issues deals with value-types and inlining. Methods that take value-type parameters are not currently considered for inlining. There is a fix in the pipe for this, as it is considered a bug. An example of this behavior can be seen in the following example function, which although far below the 32 bytes of IL limit, will not be inlined.

static float WillNotInline32(float f) {  
    return f * f;
}

.method private hidebysig static float32 WillNotInline32(float32 f) cil managed
{
  .maxstack 8
  L_0000: ldarg.0
  L_0001: ldarg.0
  L_0002: mul
  L_0003: ret
}

The resulting call to this function and the assembly code of the function looks as follows

0087008F FF75F4           push        dword ptr [ebp-0Ch]
00870092 FF154C302A00     call        dword ptr ds:[002A304Ch]
----
003F01F8 D9442404         fld         dword ptr [esp+4]
003F01FC DCC8             fmul        st(0),st
003F01FE C20400           ret         4

Clearly the x86 JIT requires a lot more work before it will be able to produce machine code approaching that of a good optimizing compiler. However, the news isn't all grim. Interop between .NET and unmanaged code allows for you to write those methods that need to be highly optimized in a lower level language.