ONA Bag

So, I spoiled myself recently and got myself The Union Street ONA bag.

So far I’m liking it, we’ll see how long that lasts. Below you can see some shots of the bag along with it being packed with the camera, lenses, laptop, and tripod.

A Simple C++ Quiz

Recently some people have been pestering me to post back up my C++ quizzes. So…without further ado here is the first one. The answers will be posted later.

  1. Given the following three lines of code, answer these questions
    1. int* p = new int[10];
    2. int* j = p + 11;
    3. int* k = p + 10;

    1. Is the second line well defined behavior?
    2. If the second line is well defined, where does the pointer point to?
    3. What are some of the legal operations that can be performed on the third pointer?
  2. What output should the following lines of code produce?
    1. int a = 10;
    2. std::cout<<a<<a++<<–a;
  3. Assuming the function called in the following block of code has no default parameters, and that no operators are overloaded, how many parameters does it take? Which objects are passed to it?
    1. f((a, b, c), d, e, ((g, h), i));
  4. Assuming the function called in the following block of code takes an A* and a B*, what is potentially wrong with the code?
    1. f(new A(), new B());

C++ Quiz #2

This is a test of your knowledge of C++, not of your compiler’s knowledge of C++. Using a compiler during this test will likely give you the wrong answers, or at least incomplete ones.

1. Using the code below as a reference, explain what behavior should be expected of each of the commented lines, please keep your answer very short.

struct Base {
	virtual void Arr();
};
struct SubBase1 : virtual Base {
};
struct SubBase2 : virtual Base {
	SubBase2(Base*, SubBase1*);
	virtual void Arr();
};
struct Derived : SubBase1, SubBase2 {
	Derived() : SubBase2((SubBase1*)this, this) {
	}
};

SubBase2::SubBase2(Base* a, SubBase1* b) {
	typeid(*a);                 //1
	dynamic_cast(a); //2
	typeid(*b);                 //3
	dynamic_cast(b); //4
	a->Arr();                   //5
	b->Arr();                   //6
}

2. Using the code below as a reference, explain what behavior should be expected of each of the commented lines?

template class X {
	X* p;  //1
	X a;  //2
};

C++ Quiz #3

This is a test of your knowledge of C++, not of your compiler’s knowledge of C++. Using a compiler during this test will likely give you the wrong answers, or at least incomplete ones.

Given the following code:

class Base {
public:
	virtual ~Base() {}
	virtual void DoSomething() {}	
	void Mutate();
};

class Derived : public Base {
public:	
	virtual void DoSomething() {}
};

void Base::Mutate() {	
	new (this) Derived; // 1
}
void f() {	
	void* v = ::operator new(sizeof(Base) + sizeof(Derived));
	Base* p = new (v) Base();	
	p->DoSomething(); // 2	
	p->Mutate();      // 3	
	void* vp = p;     // 4	
	p->DoSomething(); // 5
}

1. Does the first numbered line result in defined behavior? (Yes/No)
2. What should the first numbered line do?
3. Do the second and third numbered lines produce defined behavior? (Yes/No)
4. Does the fourth numbered line produce defined behavior? If so, why? If not, why?
5. Does the fifth numbered line produce defined behavior? If so, why? If not, why?

6. What is the behavior of calling void exit(int);?

Given the following code:

struct T{};
struct B {
	~B();
};

void h() {	
	B b;	
	new (&b) T; // 1	
	return;     // 2
}

7. Does the first numbered line result in defined behavior?
8. Is the behavior of the second line defined? If so, why? If not, why is the behavior not defined?

9. What is the behavior of int& p = *(int*)0;? Why does it have that behavior? Is this a null reference?

10. What is the behavior of p->I::~I(); if I is defined as typedef int I; and p is defined as I* p;?

C++ Quiz #4

This is a test of your knowledge of C++, not of your compiler’s knowledge of C++. Using a compiler during this test will likely give you the wrong answers, or at least incomplete ones.

1. What is the value of i after the first numbered line is evaluated?
2. What do you expect the second numbered line to print out?
3. What is the value of p->c after the third numbered line is evaluated?
4. What does the fourth numbered line print?

struct C;

void f(C* p);

struct C {
	int c;
	C() : c(1) {
		f(this);
	}
};

const C obj;
void f(C* p) {
	int i = obj.c << 2;             //1
	std::cout<< p->c <<std::endl;   //2
	p->c = i;                       //3
	std::cout<< obj.c << std::endl; //4
}

5. What should you expect the compiler to do on the first numbered line? Why?

6. What should you expect the value of j to be after the second numbered line is evaluated? Why?

struct X {
	operator int() {
		return 314159;
	}
};

struct Y {
	operator X() {
		return X();
	}
};

Y y;
int i = y;     //1
int j = X(y);  //2

7. What should you expect the compiler to do on the first and second numbered lines? Why?

struct Z {
	Z() {}
	explicit Z(int) {}
};

Z z1 = 1;                 //1
Z z2 = static_cast<Z>(1); //2

8. What should you expect the behavior of each of the numbered lines, irrespective of the other lines, to be?

struct Base {
	virtual ~Base() {}
};

struct Derived : Base {	
	~Derived() {}
};

typedef Base Base2;
Derived d;
Base* p = &d;
void f() {	
	d.Base::~Base();    //1
	p->~Base();         //2	
	p->~Base2();        //3	
	p->Base2::~Base();  //4	
	p->Base2::~Base2(); //5
}

Building a SlimDX MiniTriangle sample with Direct3D11 and IronPython

I generally don’t post huge code dumps, mainly because I find them more annoying and less helpful than some books/authors might. But you know, I’ve been playing with IronPython/SlimDX recently and decided to do up another SlimDX Sample (demonstrating DX11), except in IronPython this time. This will be in the SlimDX samples sometime soon!

import clr
clr.AddReference('System.Windows.Forms')
clr.AddReference('System.Drawing')
clr.AddReference('SlimDX')

from System import *
from System.Drawing import Size
from System.Windows.Forms import Form, Application, MessageBox, FormBorderStyle
from SlimDX import *
from SlimDX.Direct3D11 import *
from SlimDX.DXGI import SwapChainDescription, SwapChainFlags, ModeDescription, SampleDescription, Usage, SwapEffect, Format, PresentFlags, Factory, WindowAssociationFlags
from SlimDX.D3DCompiler import *
from SlimDX.Windows import MessagePump

class GameObject:
    def Render(self):
        pass
    def Tick(self):
        pass
    
class GraphicsDevice(IDisposable):
    Context = property(lambda self: self.context)
    Device = property(lambda self: self.device)
    SwapChain = property(lambda self: self.swapChain)
    
    def __init__(self, control, fullscreen):
        self.fullscreen = fullscreen
        self.control = control
        
        control.Resize += lambda sender, args: self.Resize()
        
        swapChainDesc = self.CreateSwapChainDescription();
        success,self.device,self.swapChain = Device.CreateWithSwapChain(DriverType.Hardware, DeviceCreationFlags.None, Array[FeatureLevel]([FeatureLevel.Level_11_0, FeatureLevel.Level_10_1, FeatureLevel.Level_10_0]), swapChainDesc)
        self.context = self.Device.ImmediateContext
        
        with self.swapChain.GetParent[Factory]() as factory:
            factory.SetWindowAssociation(self.control.Handle, WindowAssociationFlags.IgnoreAll)
            
        with Resource.FromSwapChain[Texture2D](self.swapChain, 0) as backBuffer:
            self.backBufferRTV = RenderTargetView(self.Device, backBuffer)

        self.Resize()        
        
    def CreateSwapChainDescription(self):
        swapChainDesc = SwapChainDescription()
        swapChainDesc.IsWindowed = not self.fullscreen
        swapChainDesc.BufferCount = 1
        swapChainDesc.ModeDescription = ModeDescription(self.control.ClientSize.Width, self.control.ClientSize.Height, Rational(60, 1), Format.R8G8B8A8_UNorm)
        swapChainDesc.Flags = SwapChainFlags.None
        swapChainDesc.SwapEffect = SwapEffect.Discard
        swapChainDesc.Usage = Usage.RenderTargetOutput
        swapChainDesc.SampleDescription = SampleDescription(1, 0)
        swapChainDesc.OutputHandle = self.control.Handle
        return swapChainDesc

    def Resize(self):
        self.Context.ClearState()
        self.backBufferRTV.Dispose()
        self.swapChain.ResizeBuffers(1, self.control.ClientSize.Width, self.control.ClientSize.Height, Format.R8G8B8A8_UNorm, SwapChainFlags.None)
        with Resource.FromSwapChain[Texture2D](self.swapChain, 0) as backBuffer:
            self.backBufferRTV = RenderTargetView(self.Device, backBuffer)
        self.Context.Rasterizer.SetViewports(Viewport(0, 0, self.control.ClientSize.Width, self.control.ClientSize.Height, 0.0, 1.0))

    def BeginRender(self):
        self.Context.ClearRenderTargetView(self.backBufferRTV, Color4(0, 0, 0, 0))
        self.Context.OutputMerger.SetTargets(self.backBufferRTV)
        
        
    def EndRender(self):
        self.swapChain.Present(0, PresentFlags.None)

    def Dispose(self):
        self.backBufferRTV.Dispose()
        self.swapChain.Dispose()
        self.device.Dispose()


class TriangleObject(GameObject):
    def __init__(self, game):
        self.game = game
        device = game.GraphicsDevice.Device
        context = game.GraphicsDevice.Context
        
        err = clr.Reference[str]()
        with ShaderBytecode.CompileFromFile(&quot;SimpleTriangle10.fx&quot;, &quot;fx_5_0&quot;, ShaderFlags.None, EffectFlags.None, None, None, err) as shaderByteCode:
            self.effect = Effect(device, shaderByteCode)

        shaderTechnique = self.effect.GetTechniqueByIndex(0)
        self.shaderPass = shaderTechnique.GetPassByIndex(0)

        sig = self.shaderPass.Description.Signature
        self.inputLayout = InputLayout(device, sig, Array[InputElement]([InputElement(&quot;POSITION&quot;, 0, Format.R32G32B32A32_Float, 0, 0), InputElement(&quot;COLOR&quot;, 0, Format.R32G32B32A32_Float, 16, 0)]))
        
        bufferDesc = BufferDescription(3 * 32, ResourceUsage.Dynamic, BindFlags.VertexBuffer, CpuAccessFlags.Write, ResourceOptionFlags.None, 0)
        self.vertexBuffer = Buffer(device, bufferDesc)
        
        stream = context.MapSubresource(self.vertexBuffer, 0, 3 * 32, MapMode.WriteDiscard, MapFlags.None).Data
        data = Array[Vector4]([
            Vector4(0.0, 0.5, 0.5, 1.0), Vector4(1.0, 0.0, 0.0, 1.0),
            Vector4(0.5, -0.5, 0.5, 1.0), Vector4(0.0, 1.0, 0.0, 1.0),
            Vector4(-0.5, -0.5, 0.5, 1.0), Vector4(0.0, 0.0, 1.0, 1.0)
        ])
        stream.WriteRange(data)
        context.UnmapSubresource(self.vertexBuffer, 0)

    def Render(self):
        context = self.game.GraphicsDevice.Context
        context.InputAssembler.InputLayout = self.inputLayout
        context.InputAssembler.PrimitiveTopology = PrimitiveTopology.TriangleList
        context.InputAssembler.SetVertexBuffers(0, VertexBufferBinding(self.vertexBuffer, 32, 0))
        self.shaderPass.Apply(context)
        context.Draw(3, 0)
        
    def Dispose(self):
        self.effect.Dispose()
        self.inputLayout.Dispose()
        self.vertexBuffer.Dispose()
        
class Game(IDisposable):
    GraphicsDevice = property(lambda self: self.graphicsDevice)
    
    def __init__(self, width, height, fullscreen = False):
        self.fullscreen = fullscreen
        self.form = GameForm(width, height, fullscreen)
        self.form.Visible = True
        self.graphicsDevice = GraphicsDevice(self.form, self.fullscreen)
        self.gameObjects = [TriangleObject(self)]
        
    def Run(self):
        Application.Idle += self.OnIdle
        Application.Run(self.form)

    def OnIdle(self, ea, sender):
        while MessagePump.IsApplicationIdle:
            self.Update()
            self.Render()
    
    def Update(self):
        for i in self.gameObjects:
            i.Tick()
            
    def Render(self):
        self.GraphicsDevice.BeginRender()
        for i in self.gameObjects:
            i.Render()
        self.GraphicsDevice.EndRender()
            
    def Dispose(self):
        self.GraphicsDevice.Dispose()
        for i in self.gameObjects:
            if 'Dispose' in dir(i):
                i.Dispose()
        self.form.Dispose(True)

class GameForm(Form):
    def __init__(self, width, height, fullscreen):
        self.ClientSize = Size(width, height)
        if fullscreen:
            self.FormBorderStyle = FormBorderStyle.None

if __name__ == &quot;__main__&quot;:
    try:
        with Game(640, 480) as game:
            game.Run()
    except Exception as e:
        MessageBox.Show(e.ToString())

Is It Really A Bug For A Beginner To Be Using C-Strings In C++?

Depends, but probably yes.

A beginning programmer should be focusing on learning to program. That is: the process of taking a concept and turning it into an application. Problem solving, in other words. Learning to program is not the same thing as learning a programming language. Learning a programming language is about learning the syntax and standard library that comes with said programming language, it may involve the process of problem solving, but that is not its primary concern.

Given that, one can quickly see that the best way to introduce a beginning programmer to programming is to get them to use a language that is quick and easy to get up and running in. There are many languages that are quick and easy to get up and running in, and they all tend to share a rather similar component… which is verbosity. Python and Ruby are two prime examples, both of which have a very simple language syntax which allows for a lot of leeway for the programmer, without all the extra clutter that many other languages have (C++ *cough*). Another good choice, in my opinion, is C# which, when combined with Microsoft Visual C#, provides a very robust but easy to learn language. These languages all have many key features which make them easy to learn and use: All of them are generally garbage collected, they all have fairly simple syntax with few (if any) corner cases, and all of them have huge standard libraries that provide for a great deal of quick and easy to use functionality with minimal programmer effort.

C++ has almost none of those things. While there is Visual Studio for it, the IntelliSense is still not perfect, even with the help of tools like WholeTomato’s VAX. The standard library is quite small, dealing mainly with file and console IO, and some minimal containers. It leaves the rest of the work up to the developer. This means that for any sufficiently complex project you will either end up implementing a majority of the behaviors needed yourself, or having to dig up third party libraries and APIs for said behavior. Even the recent C++0x work hasn’t really alleviated the problem. Then you have the language complexity of which I’ve commented on previously

However, the C++ standard library does provide some features that should be in every developers pocketbook… such as std::string. std::string behaves a lot more like what a beginning programmer expects a primitive type to work. They’ve learned that you can add integers and floats together, so why can’t they add strings together? Well, with std::string they can, but with c-strings they can’t. They’ve learned to compare integers and floats using the standard == operator, so why can’t they do that with strings? With std::string they can, but with c-strings they can’t (well, they “can”, but the behavior is not what they want). They’ve learned how to read in integers and floats from std::cin, so why can’t they do the same with strings? They can with std::string, but with c-strings they have to be careful of the length and also that they’ve pre-allocated it, which has hazards of its own… such as stack space issues when they try to create a char array of 5000 characters.

C-strings do not behave intuitively. They have no inherit length, instead preferring to use null terminators to indicate the end of the string. They cannot be trivially concatenated, instead requiring the user to ensure that appropriate space is available, and then they have to use various function calls to copy the string, and then they have to ensure that those string functions had the space required to copy the null terminator (which the strncpy and other functions MAY omit if there isn’t enough space in the destination). Comparison requires the use functionality like strcmp, which doesn’t return true/false, but instead returns an integer indicating the string difference, with 0 being no differences. In a language where the user has been taught that 0/null generally means failure, remembering to test for 0 in that one off corner case is rather strange.

For a beginner, all that strangeness doesn’t equate to extra power or better performance. Instead it equates to extra confusion, and strange crashes. Had they been taught std::string first, they would have been free and clear, able to use the familiar operators they are used to, while being safe and secure in the bosom that is std::string. In fact, it generally gets worst than that, as c-strings are usually taught before pointers! This makes it even more confusing for the poor beginner, because then they’re introduced to arrays and pointers (instead of say std::vector), and now have a whole slew of new functionality to basically kill themselves with.

Thus, in conclusion, if you see a c-string in a beginners code, it probably means they have a bug somewhere in their code.

SlimDX Direct3D10 X Loader

Here’s a useful class for loading X files (using SlimDX) into a Direct3D10 Mesh object. This is based off of Jack Hoxley’s C++ code from his journal post on GameDev.Net.

A few things to note about it: It doesn’t handle multiple materials (or materials at all). To handle that would require you to be sure to optimize the D3D9 mesh in place, then harvest the EffectInstance’s and also the materials. That way you could load the appropriate textures and bind them during rendering of the appropriate attributes. For simple X meshes this isn’t an issue, but some (like the Airplane model that comes with the DirectX SDK) have multiple textures.

using System;
using System.Windows.Forms;
using DXGI = SlimDX.DXGI;
using D3D9 = SlimDX.Direct3D9;
using SlimDX.Direct3D10;

namespace XMeshLoader {
    class XLoader : IDisposable {
        public XLoader() {
            CreateNullDevice();
        }

        #region IDisposable
        ~XLoader() {
            Dispose(false);
        }

        public void Dispose() {
            Dispose(true);
        }

        private void Dispose(bool disposeManagedObjects) {
            if (disposeManagedObjects) {
                device9.Dispose();
                form.Dispose();
            }
        }
        #endregion

        public Mesh CreateMesh(Device device, D3D9.Mesh mesh9, out InputElement[] outDecls) {
            var inDecls = mesh9.GetDeclaration();
            outDecls = new InputElement[inDecls.Length - 1];
            ConvertDecleration(inDecls, outDecls);

            var flags = MeshFlags.None;
            if ((mesh9.CreationOptions &amp; D3D9.MeshFlags.Use32Bit) != 0)
                flags = MeshFlags.Has32BitIndices;

            var mesh = new Mesh(device, outDecls, D3D9.DeclarationUsage.Position.ToString().ToUpper(), mesh9.VertexCount, mesh9.FaceCount, flags);

            ConvertIndexBuffer(mesh9, mesh);
            ConvertVertexBuffer(mesh9, mesh);
            ConfigureAttributeTable(mesh9, mesh);

            mesh.GenerateAdjacencyAndPointRepresentation(0);
            mesh.Optimize(MeshOptimizeFlags.Compact | MeshOptimizeFlags.AttributeSort | MeshOptimizeFlags.VertexCache);

            mesh.Commit();
            return mesh;
        }

        public Mesh LoadFile(Device device, string filename, out InputElement[] outDecls) {
            using (var mesh9 = D3D9.Mesh.FromFile(device9, filename, D3D9.MeshFlags.SystemMemory)) {
                return CreateMesh(device, mesh9, out outDecls);
            }
        }

        #region Implementation Details
        private static void ConfigureAttributeTable(D3D9.BaseMesh inMesh, Mesh outMesh) {
            var inAttribTable = inMesh.GetAttributeTable();

            if (inAttribTable == null || inAttribTable.Length == 0) {
                outMesh.SetAttributeTable(new[] {new MeshAttributeRange {
                    FaceCount = outMesh.FaceCount,
                    FaceStart = 0,
                    Id = 0,
                    VertexCount = outMesh.VertexCount,
                    VertexStart = 0
                }});
            } else {
                var outAttribTable = new MeshAttributeRange[inAttribTable.Length];
                for (var i = 0; i &lt; inAttribTable.Length; ++i) {
                    outAttribTable[i].Id = inAttribTable[i].AttribId;
                    outAttribTable[i].FaceCount = inAttribTable[i].FaceCount;
                    outAttribTable[i].FaceStart = inAttribTable[i].FaceStart;
                    outAttribTable[i].VertexCount = inAttribTable[i].VertexCount;
                    outAttribTable[i].VertexStart = inAttribTable[i].VertexStart;
                }
                outMesh.SetAttributeTable(outAttribTable);
            }

            outMesh.GenerateAttributeBufferFromTable();
        }

        private static void ConvertIndexBuffer(D3D9.BaseMesh inMesh, Mesh outMesh) {
            using (var inStream = inMesh.LockIndexBuffer(D3D9.LockFlags.None))
            using (var outBuffer = outMesh.GetIndexBuffer()) {
                using (var outStream = outBuffer.Map()) {
                    if ((outMesh.Flags &amp; MeshFlags.Has32BitIndices) != 0)
                        outStream.WriteRange(inStream.ReadRange(inMesh.FaceCount * 3));
                    else
                        outStream.WriteRange(inStream.ReadRange(inMesh.FaceCount * 3));
                }
                outBuffer.Unmap();
            }
            inMesh.UnlockIndexBuffer();
        }

        private static void ConvertVertexBuffer(D3D9.BaseMesh inMesh, Mesh outMesh) {
            using (var inStream = inMesh.LockVertexBuffer(D3D9.LockFlags.None))
            using (var outBuffer = outMesh.GetVertexBuffer(0)) {
                using (var outStream = outBuffer.Map()) {
                    outStream.WriteRange(inStream.ReadRange(inMesh.VertexCount * inMesh.BytesPerVertex));
                }
                outBuffer.Unmap();
            }
            inMesh.UnlockIndexBuffer();
        }

        private static void ConvertDecleration(D3D9.VertexElement[] inDecls, InputElement[] outDecls) {
            for (var i = 0; i &lt; inDecls.Length - 1; ++i) {
                outDecls[i].SemanticName = ConvertSemanticName(inDecls[i].Usage);
                outDecls[i].SemanticIndex = inDecls[i].UsageIndex;
                outDecls[i].AlignedByteOffset = inDecls[i].Offset;
                outDecls[i].Slot = inDecls[i].Stream;
                outDecls[i].Classification = InputClassification.PerVertexData;
                outDecls[i].InstanceDataStepRate = 0;
                outDecls[i].Format = ConvertFormat(inDecls[i].Type);
            }
        }

        private static string ConvertSemanticName(D3D9.DeclarationUsage usage) {
            switch (usage) {
                case D3D9.DeclarationUsage.TextureCoordinate:
                    return &quot;TEXCOORD&quot;;
                case D3D9.DeclarationUsage.PositionTransformed:
                    return &quot;POSITIONT&quot;;
                case D3D9.DeclarationUsage.TessellateFactor:
                    return &quot;TESSFACTOR&quot;;
                case D3D9.DeclarationUsage.PointSize:
                    return &quot;PSIZE&quot;;
                default:
                    return usage.ToString().ToUpper();
            }
        }

        private static DXGI.Format ConvertFormat(D3D9.DeclarationType type) {
            switch (type) {
                case D3D9.DeclarationType.Float1: return DXGI.Format.R32_Float;
                case D3D9.DeclarationType.Float2: return DXGI.Format.R32G32_Float;
                case D3D9.DeclarationType.Float3: return DXGI.Format.R32G32B32_Float;
                case D3D9.DeclarationType.Float4: return DXGI.Format.R32G32B32A32_Float;
                case D3D9.DeclarationType.Color: return DXGI.Format.R8G8B8A8_UNorm;
                case D3D9.DeclarationType.Ubyte4: return DXGI.Format.R8G8B8A8_UInt;
                case D3D9.DeclarationType.Short2: return DXGI.Format.R16G16_SInt;
                case D3D9.DeclarationType.Short4: return DXGI.Format.R16G16B16A16_SInt;
                case D3D9.DeclarationType.UByte4N: return DXGI.Format.R8G8B8A8_UNorm;
                case D3D9.DeclarationType.Short2N: return DXGI.Format.R16G16_SNorm;
                case D3D9.DeclarationType.Short4N: return DXGI.Format.R16G16B16A16_SNorm;
                case D3D9.DeclarationType.UShort2N: return DXGI.Format.R16G16_UNorm;
                case D3D9.DeclarationType.UShort4N: return DXGI.Format.R16G16B16A16_UNorm;
                case D3D9.DeclarationType.UDec3: return DXGI.Format.R10G10B10A2_UInt;
                case D3D9.DeclarationType.Dec3N: return DXGI.Format.R10G10B10A2_UNorm;
                case D3D9.DeclarationType.HalfTwo: return DXGI.Format.R16G16_Float;
                case D3D9.DeclarationType.HalfFour: return DXGI.Format.R16G16B16A16_Float;
                default: return DXGI.Format.Unknown;
            }
        }

        private void CreateNullDevice() {
            form = new Form();
            using (var direct3D = new D3D9.Direct3D())
                device9 = new D3D9.Device(direct3D, 0, D3D9.DeviceType.NullReference, form.Handle, D3D9.CreateFlags.HardwareVertexProcessing, new D3D9.PresentParameters {
                    BackBufferCount = 1,
                    BackBufferFormat = D3D9.Format.A8R8G8B8,
                    BackBufferHeight = 1,
                    BackBufferWidth = 1,
                    SwapEffect = D3D9.SwapEffect.Copy,
                    Windowed = true
                });
        }

        private Form form;
        private D3D9.Device device9;
        #endregion
    }
}

SlimGen and You, Part ADD EAX, [EAX] of N

So far I’ve covered how SlimGen works and the difficulties in doing what it does, including calling convention issues that one must be made aware of when writing replacement methods for use with SlimGen.

So the next question arises, just how much of a difference can using SlimGen make? Well, a lot of that will depend on the developer and their skill level. But we also were pretty curious about this and so we slapped together a test sample that runs through a series of matrix multiplications and times it. It uses three arrays to perform the multiplications, two of the arrays contains 100,000 randomly generated matrixes, with the third being used as the destinations for the results. Both matrix multiplications (the SlimGen one and the .Net one) assume that a source can also be used as a destination, and so they are overlap safe.

The timing results will vary, of course, from machine to machine depending on the processor in the machine, how much ram you have and also on what you’re doing at the time. Running the results against my Phenom 9850 I get:

Total Matrix Count Per Run:  100,000 
Multiply        Total Ticks: 2,001,059 
SlimGenMultiply Total Ticks: 1,269,200 
Improvement:                 36.57 % 

While when I run it against my T8300 Core2 Duo laptop I get:

Total Matrix Count Per Run:  100,000
Multiply        Total Ticks: 2,175,380
SlimGenMultiply Total Ticks: 1,621,830
Improvement:                 25.45 %

Still, 25-35% improvement over the FPU based multiply is quite significant. Since X64 support hasn’t been fully hammered out (in that it “works” but hasn’t been sufficiently verified as working), those numbers are unavailable at the moment. However, they should be available in the near future as we finalize error handling and ensure that there are no bugs in the x64 assembly handling.

So why the great difference in performance? Well, part of it is the method size, the .Net method is 566 bytes of pure code, that’s over half a kilobyte of code that has to be walked through by the processor, code which needs to be brought into the instruction-cache on the CPU and executed, meanwhile the SSE2 method is around half that size, at 266 bytes. The smaller your footprint in the I-cache, the fewer hits you take and the more likely your code is to actually be IN the I-cache. Then there’s the instructions, SSE2 has been around for a while, and so it has had plenty of time to be wrangled around with by CPU manufacturers to ensure optimal performance. Finally there’s the memory hit issue, the SSE2 based code hits memory a minimal number of times, reducing the chances of cache misses, after the first read/write, except for a few cases.

Finally there’s how it deals with storage of the temporary results. The .Net FPU based version allocates a Matrix type on the stack, calls the constructor (which 0 initializes it), and then proceeds to overwrite those entries one by one with the results of each set of dot products. At the end of the method it does what amounts to a memcpy, and copies the temporary matrix over the result matrix. The SSE2 version however doesn’t bother with initializing the stack and only stores three of the results on the stack, opting to write out the final result directly to the destination. The three other rows are then moved back into XMM registers and then back out to the destination.

The SSE2 source code, followed by the .Net source code, note that both are functionally equivalent:

start:      mov     eax, [esp + 4]
            movups  xmm4, [edx]
            movups  xmm5, [edx + 0x10]
            movups  xmm6, [edx + 0x20]
            movups  xmm7, [edx + 0x30]
            
            movups  xmm0, [ecx]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm1, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
            
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
            
            movups  [esp - 0x20], xmm0 ; store row 0 of new matrix
            
            movups  xmm0, [ecx + 0x10]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
            
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
            
            movups  [esp - 0x30], xmm0 ; store row 1 of new matrix
            
            movups  xmm0, [ecx + 0x20]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
            
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
            
            movups  [esp - 0x40], xmm0 ; store row 2 of new matrix
            
            movups  xmm0, [ecx + 0x30]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
            
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
            
            movups  [eax + 0x30], xmm0 ; store row 3 of new matrix
            movups  xmm0, [esp - 0x40]
            movups  [eax + 0x20], xmm0
            movups  xmm0, [esp - 0x30]
            movups  [eax + 0x10], xmm0
            movups  xmm0, [esp - 0x20]
            movups  [eax], xmm0
            ret     4

The .Net matrix multiplication source code:

public static void Multiply(ref Matrix left, ref Matrix right, out Matrix result) {
    Matrix r;
    r.M11 = (left.M11 * right.M11) + (left.M12 * right.M21) + (left.M13 * right.M31) + (left.M14 * right.M41);
    r.M12 = (left.M11 * right.M12) + (left.M12 * right.M22) + (left.M13 * right.M32) + (left.M14 * right.M42);
    r.M13 = (left.M11 * right.M13) + (left.M12 * right.M23) + (left.M13 * right.M33) + (left.M14 * right.M43);
    r.M14 = (left.M11 * right.M14) + (left.M12 * right.M24) + (left.M13 * right.M34) + (left.M14 * right.M44);
    r.M21 = (left.M21 * right.M11) + (left.M22 * right.M21) + (left.M23 * right.M31) + (left.M24 * right.M41);
    r.M22 = (left.M21 * right.M12) + (left.M22 * right.M22) + (left.M23 * right.M32) + (left.M24 * right.M42);
    r.M23 = (left.M21 * right.M13) + (left.M22 * right.M23) + (left.M23 * right.M33) + (left.M24 * right.M43);
    r.M24 = (left.M21 * right.M14) + (left.M22 * right.M24) + (left.M23 * right.M34) + (left.M24 * right.M44);
    r.M31 = (left.M31 * right.M11) + (left.M32 * right.M21) + (left.M33 * right.M31) + (left.M34 * right.M41);
    r.M32 = (left.M31 * right.M12) + (left.M32 * right.M22) + (left.M33 * right.M32) + (left.M34 * right.M42);
    r.M33 = (left.M31 * right.M13) + (left.M32 * right.M23) + (left.M33 * right.M33) + (left.M34 * right.M43);
    r.M34 = (left.M31 * right.M14) + (left.M32 * right.M24) + (left.M33 * right.M34) + (left.M34 * right.M44);
    r.M41 = (left.M41 * right.M11) + (left.M42 * right.M21) + (left.M43 * right.M31) + (left.M44 * right.M41);
    r.M42 = (left.M41 * right.M12) + (left.M42 * right.M22) + (left.M43 * right.M32) + (left.M44 * right.M42);
    r.M43 = (left.M41 * right.M13) + (left.M42 * right.M23) + (left.M43 * right.M33) + (left.M44 * right.M43);
    r.M44 = (left.M41 * right.M14) + (left.M42 * right.M24) + (left.M43 * right.M34) + (left.M44 * right.M44);
    result = r;
}

SlimGen and You, Part ADD AL, [RAX] of N

The question does arise though, when using SlimGen and writing your SSE replacement methods, what kind of calling convention does the CLR use?

The CLR uses a version of fastcall. On x86 processors this means that the first two parameters (that are DWORD or smaller) are passed in ECX and EDX. However, and this is where the CLR differs from standard fastcall, the parameters after the first two are pushed onto the stack from left to right, not right to left. This is important to remember, especially for functions that take a variable number of arguments. So a call like: X(‘c’, 2, 3.0f, “Hello”); becomes:

X('c', 2, 3.0f, "Hello");
00000025  push        40400000h ; 3.0f
0000002a  push        dword ptr ds:[03402088h] ;Address of "Hello"
00000030  mov         edx,2 
00000035  mov         ecx,63h ;'c'
0000003a  call        FFB8B040

The situation is the same for member functions as well, except with this being passed in ECX, which leaves only EDX to hold an additional parameter. The rest are passed on the stack as before:

p.Y(2, 3.0f);
0000006d  push        40400000h  ; 3.0f
00000072  mov         ecx,dword ptr [ebp-40h] ;this
00000075  mov         edx,2
0000007c  call        FFA1B048 

So this all seems clear enough, but it’s important to note these differences, especially when you’re poking around in the low level bowels of the CLR or when you’re doing what SlimGen does: which is replacing actual method bodies.

So this does beget the question, what about on the x64 platform? Well, again, the calling convention is fastcall with a few differences. The first four parameters are in RCX, RDX, R8 and R9 (or smaller registers), unless those parameters are floating point types, in which case they are passed using XMM registers. 

Z('c', 2, 3.0f, "Hello", 1.0, pa);
000000c0  mov         r9,124D3100h 
000000ca  mov         r9,qword ptr [r9] ; "Hello"
000000cd  mov         rax,qword ptr [rsp+38h] ;pa (IntPtr[])
000000d2  mov         qword ptr [rsp+28h],rax ;pa - stack spill
000000d7  movsd       xmm0,mmword ptr [00000118h] ;1.0
000000df  movsd       mmword ptr [rsp+20h],xmm0 ;1.0 - stack spill
000000e5  movss       xmm2,dword ptr [00000110h] ;3.0f
000000ed  mov         edx,2 ;int (2)
000000f2  mov         cx,63h ;'c' 
000000f6  call        FFFFFFFFFFEC9300

Whew, that looks pretty nasty doesn’t it? But if you notice, pretty much every single parameter to that function is passed in a register. The stack spillage is part of the calling convention to allow for variables to be spilled into memory (or read back from memory) when the register needs to be used. Calling an instance method follows pretty much the same rules, except the this pointer is passed in RCX first.

p.Q(~0L, ~1L, ~2L, ~3);
0000010a  mov         rcx,qword ptr [rsp+30h] ; this pointer
0000010f  mov         qword ptr [rsp+20h],0FFFFFFFFFFFFFFFCh ;~3L, spilled to stack
00000118  mov         r9,0FFFFFFFFFFFFFFFDh ;~2L
0000011f  mov         r8,0FFFFFFFFFFFFFFFEh ;~1L
00000126  mov         rdx,0FFFFFFFFFFFFFFFFh ;~0L
0000012d  call        FFFFFFFFFFEC9310

Calling a function and passing something larger than a register can store does pose an interesting problem, the CLR deals with it by moving the entire data onto the stack, and passing it (hence call by value)

var v = new Vector();
p.R(v);
00000169  lea         rcx,[rsp+40h] 
0000016e  mov         rax,qword ptr [rcx] 
00000171  mov         qword ptr [rsp+50h],rax 
00000176  mov         rax,qword ptr [rcx+8] 
0000017a  mov         qword ptr [rsp+58h],rax 
0000017f  lea         rdx,[rsp+50h] 
00000184  mov         rcx,r8 
00000187  call        FFFFFFFFFFEC9318

As you can see, it copies the data from the vector onto the stack, stores the this pointer in RCX, and then calls to the function. This is why pass by reference is the preferred method (for fast code) to move around structures that are non-trivial.

All of this goes into calcuating our matrix multiplication method (which assumes the output is not one of the inputs):

BITS        32
ORG         0x59f0
;           void Multiply(ref Matrix, ref Matrix, out Matrix)
start:      mov     eax, [esp + 4]
            movups  xmm4, [edx]
            movups  xmm5, [edx + 0x10]
            movups  xmm6, [edx + 0x20]
            movups  xmm7, [edx + 0x30]
            
            movups  xmm0, [ecx]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm1, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
            
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
            
            movups  [eax], xmm0 ; Calculate row 0 of new matrix
            
            movups  xmm0, [ecx + 0x10]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
            
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
            
            movups  [eax + 0x10], xmm0 ; Calculate row 1 of new matrix
            
            movups  xmm0, [ecx + 0x20]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
            
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
            
            movups  [eax + 0x20], xmm0 ; Calculate row 2 of new matrix
            
            movups  xmm0, [ecx + 0x30]
            movaps  xmm1, xmm0
            movaps  xmm2, xmm0
            movaps  xmm3, xmm0
            shufps  xmm0, xmm0, 0x00
            shufps  xmm1, xmm1, 0x55
            shufps  xmm2, xmm2, 0xAA
            shufps  xmm3, xmm3, 0xFF
            
            mulps   xmm0, xmm4
            mulps   xmm1, xmm5
            mulps   xmm2, xmm6
            mulps   xmm3, xmm7
            addps   xmm0, xmm2
            addps   xmm1, xmm3
            addps   xmm0, xmm1
            
            movups  [eax + 0x30], xmm0 ; Calculate row 3 of new matrix
            ret     4