Playing With The .NET JIT Part 2

Previously I discussed various potential issues the x86 JIT had with inlining non-trivial methods and functions taking or returning value types. In this entry I hope to cover some potential pitfalls facing would be optimizers, along with discussing some unexpected optimizations that do take place.

Optimizations That Aren't

It is not that uncommon to see people advocating the usage of unsafe code as a means of producing "optimized" code in the managed environment. The idea is a simple one, by getting down to the metal with pointers and all that fun stuff, you can somehow produce code that will be "optimized" in ways that typical managed code cannot be.

Unsafe code does not allow you to manipulate pointers to managed objects in whatever manner you please. Certain steps have to be taken to ensure that your operations are safe with regards to the managed heap. Just because your code is marked as "unsafe" doesn't mean that it is free to do what it wants. For example, you cannot assign a pointer the address of a managed object without first pinning the object. Pointers to objects are not tracked by the GC, so should you obtain a pointer to an object and then attempt to use the pointer, you could end up accessing a now collected region of memory. What can also happen is that you could obtain a pointer to an object, but when the GC runs your object could be shuffled around on the heap. This shuffling would invalidate your pointer, but since pointers are not tracked by the GC it would not be updated (while references to objects are updated). Pinning objects solves this problem, and hence is why you are only allowed to take the address of an object that's been pinned. In essence, a pinned object cannot be moved nor collected by the GC until it is unpinned. This is typically done through the use of the fixed keyword in C# or the GCHandle structure.

Much like how a fixed object cannot be moved by the GC, a pointer to a fixed object cannot be reassigned. This makes it difficult to traverse primitive arrays, as you end up needing to create other temporary pointers, or limiting the size of the fixed area to a small segment. Fixed objects, and unsafe code, increase the overall size of the produced IL by a fairly significant margin. While an increase in the IL is not indicative of the size of the produced machine code, it does prevent the runtime from inlining such methods. As an example, the two following snippets reveal the difference between a safe inner product and an unsafe one; note that in the unmanaged case it was using a fixed sized buffer.

public float Magnitude() {  
    return (float)Math.Sqrt(X * X + Y * Y + Z * Z);
}

.method public hidebysig instance float32 Magnitude() cil managed
{
    .maxstack 8
    L_0000: ldarg.0
    L_0001: ldfld float32 PerformanceTests.Vector3::X
    L_0006: ldarg.0
    L_0007: ldfld float32 PerformanceTests.Vector3::X
    L_000c: mul
    L_000d: ldarg.0
    L_000e: ldfld float32 PerformanceTests.Vector3::Y
    L_0013: ldarg.0
    L_0014: ldfld float32 PerformanceTests.Vector3::Y
    L_0019: mul
    L_001a: add
    L_001b: ldarg.0
    L_001c: ldfld float32 PerformanceTests.Vector3::Z
    L_0021: ldarg.0
    L_0022: ldfld float32 PerformanceTests.Vector3::Z
    L_0027: mul
    L_0028: add
    L_0029: conv.r8
    L_002a: call float64 [mscorlib]System.Math::Sqrt(float64)
    L_002f: conv.r4
    L_0030: ret
}

public float Magnitude() {  
      fixed (float* p = V) {
          return (float)Math.Sqrt(p[0] * p[0] + p[1] * p[1] + p[2] * p[2]);
      }
}

.method public hidebysig instance float32 Magnitude() cil managed
{
    .maxstack 4
    .locals init (
    [0] float32& pinned singleRef1,
    [1] float32 single1)
    L_0000: ldarg.0
    L_0001: ldflda PerformanceTests.Unsafe.Vector3/e__FixedBuffer0 PerformanceTests.Unsafe.Vector3::V
    L_0006: ldflda float32 PerformanceTests.Unsafe.Vector3/e__FixedBuffer0::FixedElementField
    L_000b: stloc.0
    L_000c: ldloc.0
    L_000d: conv.i
    L_000e: ldind.r4
    L_000f: ldloc.0
    L_0010: conv.i
    L_0011: ldind.r4
    L_0012: mul
    L_0013: ldloc.0
    L_0014: conv.i
    L_0015: ldc.i4.4
    L_0016: conv.i
    L_0017: add
    L_0018: ldind.r4
    L_0019: ldloc.0
    L_001a: conv.i
    L_001b: ldc.i4.4
    L_001c: conv.i
    L_001d: add
    L_001e: ldind.r4
    L_001f: mul
    L_0020: add
    L_0021: ldloc.0
    L_0022: conv.i
    L_0023: ldc.i4.8
    L_0024: conv.i
    L_0025: add
    L_0026: ldind.r4
    L_0027: ldloc.0
    L_0028: conv.i
    L_0029: ldc.i4.8
    L_002a: conv.i
    L_002b: add
    L_002c: ldind.r4
    L_002d: mul
    L_002e: add
    L_002f: conv.r8
    L_0030: call float64 [mscorlib]System.Math::Sqrt(float64)
    L_0035: conv.r4
    L_0036: stloc.1
    L_0037: leave.s L_0039
    L_0039: ldloc.1
    L_003a: ret
}

Note that neither of these two appear to be candidates for inlining, both being well over the 32 byte IL limit. The produced IL, while not directly indicative of the assembly produced by the JIT compiler, does tend to give an overall idea of how much larger we should expect this method to be when reproduced in machine code. Fixed length buffers have other issues that need addressing: You cannot access a fixed length buffer outside of a fixed statement. They are also an unsafe construct, and so you must indicate that the type is unsafe. Finally, they produce temporary types at compilation time that can throw off serialization and other reflection based mechanisms.

In the end, unsafe code does not increase performance, and the reliance upon platform structures to ensure safety, such as the fixed construct, introduces more problems than it solves. Furthermore, even the smallest method that might be inlined tends to bloat up to the point where inlining by the JIT is no longer possible.

Surprising Developments and JIT Optimizations

Previously I noted that the JIT compiler can only inline a method that is a maximum of 32 bytes of IL in length. However, I wasn't completely honest with you. In some cases the JIT compiler will inline chunks of code that are longer than 32 bytes of IL. I have not dug in-depth into the reasons for this, nor when these conditions may arise. As such this information is presented as an informal experimental result. In the case of a function returning the result of an intrinsic operation, there may arise a condition whereby the result is inlined. Two examples of this behavior will be shown, note that in both cases the function used is an intrinsic math function and that neither are passed value types (which will prevent inlining). The first is the Magnitude function, which we saw above. Calling it results in it being inlined and produces the following inlined assembly.

00220164 D945D4         fld        dword ptr [ebp-2Ch]
00220167 D8C8           fmul       st,st(0)
00220169 D945D8         fld        dword ptr [ebp-28h]
0022016C D8C8           fmul       st,st(0)
0022016E DEC1           faddp      st(1),st
00220170 D945DC         fld        dword ptr [ebp-24h]
00220173 D8C8           fmul       st,st(0)
00220175 DEC1           faddp      st(1),st
00220177 DD5D9C         fstp       qword ptr [ebp-64h]
0022017A DD459C         fld        qword ptr [ebp-64h]
0022017D D9FA           fsqrt

We note that this is the optimal form for the magnitude function, with a minimal number of memory reads, the majority of the work taking place on the FPU stack. Compared to the unsafe version, which is shown next, you can clearly see how much worse unsafe code is.

007A0438 55             push       ebp
007A0439 8BEC           mov        ebp,esp
007A043B 57             push       edi
007A043C 56             push       esi
007A043D 53             push       ebx
007A043E 83EC10         sub        esp,10h
007A0441 33C0           xor        eax,eax
007A0443 8945F0         mov        dword ptr [ebp-10h],eax
007A0446 894DF0         mov        dword ptr [ebp-10h],ecx
007A0449 D901           fld        dword ptr [ecx]
007A044B 8BF1           mov        esi,ecx
007A044D D80E           fmul       dword ptr [esi]
007A044F 8BF9           mov        edi,ecx
007A0451 D94704         fld        dword ptr [edi+4]
007A0454 8BD1           mov        edx,ecx
007A0456 D84A04         fmul       dword ptr [edx+4]
007A0459 DEC1           faddp      st(1),st
007A045B 8BC1           mov        eax,ecx
007A045D D94008         fld        dword ptr [eax+8]
007A0460 8BD8           mov        ebx,eax
007A0462 D84B08         fmul       dword ptr [ebx+8]
007A0465 DEC1           faddp      st(1),st
007A0467 DD5DE4         fstp       qword ptr [ebp-1Ch]
007A046A DD45E4         fld        qword ptr [ebp-1Ch]
007A046D D9FA           fsqrt
007A046F D95DEC         fstp       dword ptr [ebp-14h]
007A0472 D945EC         fld        dword ptr [ebp-14h]
007A0475 8D65F4         lea        esp,[ebp-0Ch]
007A0478 5B             pop        ebx
007A0479 5E             pop        esi
007A047A 5F             pop        edi
007A047B 5D             pop        ebp
007A047C C3             ret

Next up is a fairly ubiquitous utility function which obtains the angle between two unit length vectors, note that acos is not directly producible as a machine instruction, none the less it is considered an intrinsic function. As we see below, this produces a nicely optimized set of instructions, with only a single call to a function (which computes acos).

public static float AngleBetween(ref Vector3 lhs, ref Vector3 rhs) {  
    return (float)Math.Acos(lhs.X * rhs.X + lhs.Y * rhs.Y + lhs.Z * rhs.Z);
}

.method public hidebysig static float32 AngleBetween(PerformanceTests.Vector3& lhs, PerformanceTests.Vector3& rhs) cil managed
{
    .maxstack 8
    L_0000: ldarg.0
    L_0001: ldfld float32 PerformanceTests.Vector3::X
    L_0006: ldarg.1
    L_0007: ldfld float32 PerformanceTests.Vector3::X
    L_000c: mul
    L_000d: ldarg.0
    L_000e: ldfld float32 PerformanceTests.Vector3::Y
    L_0013: ldarg.1
    L_0014: ldfld float32 PerformanceTests.Vector3::Y
    L_0019: mul
    L_001a: add
    L_001b: ldarg.0
    L_001c: ldfld float32 PerformanceTests.Vector3::Z
    L_0021: ldarg.1
    L_0022: ldfld float32 PerformanceTests.Vector3::Z
    L_0027: mul
    L_0028: add
    L_0029: conv.r8
    L_002a: call float64 [mscorlib]System.Math::Acos(float64)
    L_002f: conv.r4
    L_0030: ret
}
007A01D9 8D55D4          lea        edx,[ebp-2Ch]  
007A01DC 8D4DC8          lea        ecx,[ebp-38h]  
007A01DF D902            fld        dword ptr [edx]  
007A01E1 D809            fmul       dword ptr [ecx]  
007A01E3 D94204          fld        dword ptr [edx+4]  
007A01E6 D84904          fmul       dword ptr [ecx+4]  
007A01E9 DEC1            faddp      st(1),st  
007A01EB D94208          fld        dword ptr [edx+8]  
007A01EE D84908          fmul       dword ptr [ecx+8]  
007A01F1 DEC1            faddp      st(1),st  
007A01F3 83EC08          sub        esp,8  
007A01F6 DD1C24          fstp       qword ptr [esp]  
007A01F9 E868A5AF79      call       7A29A766 (System.Math.Acos(Double), mdToken: 06000b28)  

Finally there is the issue of SIMD instruction sets. While the JIT will not use SIMD instructions on the x86 platform, it will utilize them for other operations. One common operation you see is the conversion of floating point numbers to integers. In .NET 2.0 the JIT will optimize this to use the SSE2 instruction. For instance, the following snippet of code will result in the assembly dump following.

int n = (int)r.NextDouble();

002A02FB 8BCB            mov        ecx,ebx  
002A02FD 8B01            mov        eax,dword ptr [ecx]  
002A02FF FF5048          call       dword ptr [eax+48h]  
002A0302 DD5DA0          fstp       qword ptr [ebp-60h]  
002A0305 F20F1045A0      movsd      xmm0,mmword ptr [ebp-60h]  
002A030A F20F2CF0        cvttsd2si  esi,xmm0  

While not quite as optimal as it could be if the JIT were using the full SSE2 instruction set, this minor optimization can go a long way.

So what is left to visit? Well, there's obviously the x64 platform, which is growing in popularity. The x64 platform presents new opportunities to explore, including certain guarantees and performance benefits that aren't available on the x86 platform. Amongst them are a whole new set of optimizations and available instruction sets that the JIT can take advantage of. Finally there is the case of calling to unmanaged code for highly performance intensive operations. Hand optimized SIMD code and the potential performance benefits or hazards calling to an unmanaged function can incur.