SlimGen and You, Part ADD AL, [RAX] of N

The question does arise though, when using SlimGen and writing your SSE replacement methods, what kind of calling convention does the CLR use?

The CLR uses a version of fastcall. On x86 processors this means that the first two parameters (that are DWORD or smaller) are passed in ECX and EDX. However, and this is where the CLR differs from standard fastcall, the parameters after the first two are pushed onto the stack from left to right, not right to left. This is important to remember, especially for functions that take a variable number of arguments. So a call like: X('c', 2, 3.0f, "Hello"); becomes:

X('c', 2, 3.0f, "Hello");  
00000025  push        40400000h ; 3.0f  
0000002a  push        dword ptr ds:[03402088h] ;Address of "Hello"  
00000030  mov         edx,2  
00000035  mov         ecx,63h ;'c'  
0000003a  call        FFB8B040  

The situation is the same for member functions as well, except with this being passed in ECX, which leaves only EDX to hold an additional parameter. The rest are passed on the stack as before:

p.Y(2, 3.0f);  
0000006d  push        40400000h  ; 3.0f  
00000072  mov         ecx,dword ptr [ebp-40h] ;this  
00000075  mov         edx,2  
0000007c  call        FFA1B048  

So this all seems clear enough, but it’s important to note these differences, especially when you’re poking around in the low level bowels of the CLR or when you’re doing what SlimGen does: which is replacing actual method bodies.

So this does beget the question, what about on the x64 platform? Well, again, the calling convention is fastcall with a few differences. The first four parameters are in RCX, RDX, R8 and R9 (or smaller registers), unless those parameters are floating point types, in which case they are passed using XMM registers.

Z('c', 2, 3.0f, "Hello", 1.0, pa);  
000000c0  mov         r9,124D3100h  
000000ca  mov         r9,qword ptr [r9] ; "Hello"  
000000cd  mov         rax,qword ptr [rsp+38h] ;pa (IntPtr[])  
000000d2  mov         qword ptr [rsp+28h],rax ;pa - stack spill  
000000d7  movsd       xmm0,mmword ptr [00000118h] ;1.0  
000000df  movsd       mmword ptr [rsp+20h],xmm0 ;1.0 - stack spill  
000000e5  movss       xmm2,dword ptr [00000110h] ;3.0f  
000000ed  mov         edx,2 ;int (2)  
000000f2  mov         cx,63h ;'c'  
000000f6  call        FFFFFFFFFFEC9300  

Whew, that looks pretty nasty doesn’t it? But if you notice, pretty much every single parameter to that function is passed in a register. The stack spillage is part of the calling convention to allow for variables to be spilled into memory (or read back from memory) when the register needs to be used. Calling an instance method follows pretty much the same rules, except the this pointer is passed in RCX first.

p.Q(~0L, ~1L, ~2L, ~3);  
0000010a  mov         rcx,qword ptr [rsp+30h] ; this pointer  
0000010f  mov         qword ptr [rsp+20h],0FFFFFFFFFFFFFFFCh ;~3L, spilled to stack  
00000118  mov         r9,0FFFFFFFFFFFFFFFDh ;~2L  
0000011f  mov         r8,0FFFFFFFFFFFFFFFEh ;~1L  
00000126  mov         rdx,0FFFFFFFFFFFFFFFFh ;~0L  
0000012d  call        FFFFFFFFFFEC9310  

Calling a function and passing something larger than a register can store does pose an interesting problem, the CLR deals with it by moving the entire data onto the stack, and passing it (hence call by value)

var v = new Vector();  
p.R(v);  
00000169  lea         rcx,[rsp+40h]  
0000016e  mov         rax,qword ptr [rcx]  
00000171  mov         qword ptr [rsp+50h],rax  
00000176  mov         rax,qword ptr [rcx+8]  
0000017a  mov         qword ptr [rsp+58h],rax  
0000017f  lea         rdx,[rsp+50h]  
00000184  mov         rcx,r8  
00000187  call        FFFFFFFFFFEC9318  

As you can see, it copies the data from the vector onto the stack, stores the this pointer in RCX, and then calls to the function. This is why pass by reference is the preferred method (for fast code) to move around structures that are non-trivial.

All of this goes into calcuating our matrix multiplication method (which assumes the output is not one of the inputs):

BITS        32  
ORG         0x59f0  
;           void Multiply(ref Matrix, ref Matrix, out Matrix)
start:      mov     eax, [esp + 4]  
            movups  xmm4, [edx]
            movups  xmm5, [edx + 0x10]
            movups  xmm6, [edx + 0x20]
            movups  xmm7, [edx + 0x30]

            movups  xmm0, [ecx]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm1, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF

            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1

            movups  [eax], xmm0 ; Calculate row 0 of new matrix

            movups  xmm0, [ecx + 0x10]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF

            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1

            movups  [eax + 0x10], xmm0 ; Calculate row 1 of new matrix

            movups  xmm0, [ecx + 0x20]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF

            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1

            movups  [eax + 0x20], xmm0 ; Calculate row 2 of new matrix

            movups  xmm0, [ecx + 0x30]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF

            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1

            movups  [eax + 0x30], xmm0 ; Calculate row 3 of new matrix
            ret     4