Explorations on Random Numbers, Part 4

Astute readers will notice that the sample code in the previous post passes in a constant value as the seed to the random number generator. What would happen if we tried passing in the current time? Obviously, that would completely prevent the optimization we saw last time that just calculates the result, since the compiler won’t know ahead of time what time you’re running your program. So, let’s compile this code down:

int main()
{
	RAND tRand;
	RandInit(tRand, time(NULL));

	unsigned int nRandom = RandGetNum(tRand);
	printf("%u\n", nRandom);

	return 0;
}

This is the code that the compiler generates for RandGetNum() in x86:

; Comments added by me
; For reference: v = 36969 * (v & 0xFFFFFFFF) + (v >> 32);
; edx:eax contains v
0137101B  imul        eax,eax,9069h  ; Multiply low part of V by 36969
01371021  add         edx,eax        ; Add multiplied low part and high part

Holy crap! That’s just TWO instructions for the whole thing! That’s even better than the 64-bit version! Speaking of which, what does that look like now?

; Comments added by me
; For reference: v = 36969 * (v & 0xFFFFFFFF) + (v >> 32);
; rax contains v
000000013F881024  mov         rdx,rax        ; Copy v into rdx
000000013F881027  shr         rdx,20h        ; Set rdx to the high part of v
000000013F88102B  imul        eax,eax,9069h  ; Multiply the low part of v by 36969
000000013F881031  add         edx,eax        ; Add multiplied low part to high part of v

Hmmm… That hasn’t changed much at all from the looped version we saw earlier. In fact, it’s exactly the same set of operations: move, shift, multiply, add. Something interesting does come out of this, though, and that is that the 32-bit code and the last two instructions in the 64-bit code are identical, which means that the preceding two instructions in the 64-bit code are just “ceremony”. That is, they’re getting the right data into the right registers for the later operations to “just work.” In the 32-bit code, the high and low parts are already split up by definition, since the only way to operate on a 64-bit register using 32-bit instructions is to split it up into two 32-bit registers. In the 64-bit code, there is an easy way to get the lower 32-bits of a register (rax is a 64-bit extension of eax in the same way that eax is a 32-bit extension of ax), but no built-in way to get the upper 32-bits other than copying the value and shifting it right.

Ultimately, we have to trust that the compiler has some understanding of what the code is doing, and how to translate that into efficient and correct machine code. It can be surprising, though, just exactly how much understanding the compiler has about what’s going on and what the best way to organize the code is. The whole point of these last two posts has been to delve into what the compiler is actually doing (don’t be afraid of the disassembly window!).

Your code is going to be running on a real computer, and having an understanding of what the computer is actually doing and how the code is executed is part of what makes a great programmer. It can be interesting, educational, and fun to see what code the compiler is generating.

Leave a Comment

Explorations on Random Numbers, Part 2.5

Before we delve into the nitty-gritty details of how the compiler optimizes our MWC RNG, I’d like to present our full code from the GSDEngine (modified slightly for public consumption). All of this code is released into the public domain.

Here’s a C-style version (but still compiled with C++ compiler):

// This is based off of George Marsaglia's post to Sci.Stat.Math
// http://www.bobwheeler.com/statistics/Password/MarsagliaPost.txt
// This is the Multiply With Carry random number generator, modified
// to use a single 64-bit seed to create a 32-bit RNG, rather than
// gluing together two 16-bit RNGs (and thus two seeds)

// Default V value is the conglomeration of the default Z and W values from the Marsaglia MWC RNG
#define DEFAULT_V_VALUE 0x159A55E51F123BB5

struct RAND
{
	unsigned long long v;
};

inline void RandInit(RAND& tRand, unsigned long long seed)
{
	if(seed == 0)
		seed = DEFAULT_V_VALUE;

	tRand.v = seed;
}

// Gets a number in the range [0, UINT_MAX]
inline unsigned int RandGetNum(RAND& tRand)
{
	tRand.v = 36969 * (tRand.v & 0xFFFFFFFF) + (tRand.v >> 32);
	return tRand.v & 0xFFFFFFFF;
}

// Gets a random number in the range [0, Max) (that is, from 0 to Max-1)
inline unsigned int RandGetNum(RAND& tRand, unsigned int nMax)
{
	return nMax ? RandGetNum(tRand) % nMax : RandGetNum(tRand);
}

// Gets a random number in the range [Min, Max]
// You will get weird results if Min is greater than Max
inline int RandGetNum(RAND& tRand, int nMin, int nMax)
{
	return RandGetNum(tRand, (nMax - nMin + 1)) + nMin;
}

// Returns a floating point number in the range [0,1]
inline float RandGetFloat(RAND& tRand)
{
	const float fInverse = 1.0f / (float)0xFFFFFFFF;
	return RandGetNum(tRand) * fInverse;
}

Or, if you prefer C++:

// This is based off of George Marsaglia's post to Sci.Stat.Math
// http://www.bobwheeler.com/statistics/Password/MarsagliaPost.txt
// This is the Multiply With Carry random number generator, modified
// to use a single 64-bit seed to create a 32-bit RNG, rather than
// gluing together two 16-bit RNGs (and thus two seeds)

// Default V value is the conglomeration of the default Z and W values from the Marsaglia MWC RNG
#define DEFAULT_V_VALUE 0x159A55E51F123BB5

class RandomNumberGenerator
{
private:
	unsigned long long v;

public:
	RandomNumberGenerator(unsigned long long seed = 0)
	{
		if(seed == 0)
			seed = DEFAULT_V_VALUE;

		v = seed;
	}

	// Gets a number in the range [0, UINT_MAX]
	unsigned int GetNum()
	{
		v = 36969 * (v & 0xFFFFFFFF) + (v >> 32);
		return v & 0xFFFFFFFF;
	}

	// Gets a random number in the range [0, Max) (that is, from 0 to Max-1)
	unsigned int GetNum(unsigned int nMax)
	{
		return nMax ? GetNum() % nMax : GetNum();
	}

	// Gets a random number in the range [Min, Max]
	// You will get weird results if Min is greater than Max
	int GetNum(int nMin, int nMax)
	{
		return GetNum((nMax - nMin + 1)) + nMin;
	}

	// Returns a floating point number in the range [0,1]
	float GetFloat()
	{
		const float fInverse = 1.0f / (float)0xFFFFFFFF;
		return GetNum() * fInverse;
	}

	// Gets a random number in the range [0, Max) (that is, from 0 to Max-1)
	unsigned int operator ()(unsigned int nMax = 0)
	{
		return nMax ? GetNum(nMax) : GetNum();
	}
};

Let’s write some code to use this RNG:

int main()
{
	RAND tRand;
	RandInit(tRand, 0);

	unsigned int nRandom = RandGetNum(tRand);
	printf("%u\n", nRandom);

	return 0;
}

Or, with the C++ class:

int main()
{
	RandomNumberGenerator tRand;

	unsigned int nRandom = tRand.GetNum();
	printf("%u\n", nRandom);

	return 0;
}

Next time, we’ll really delve into compiler optimizations, I promise!

Leave a Comment

Explorations on Random Numbers, Part 3

The last few posts, we’ve explored various random number generators, and ended up with a very simple and solid implementation of George Marsaglia’s Multiply-With-Carry RNG. At this point, we’re done talking about random number generators. We’re going to switch gears and talk about how the compiler actually takes the RNG code and turns it into machine code.

Warning: we will be delving into assembly language, so be prepared!

Let’s start with some very simple code that uses our MWC RNG:

int main()
{
	RAND tRand;
	RandInit(tRand, 0);

	unsigned int nRandom = RandGetNum(tRand);
	printf("%u\n", nRandom);

	return 0;
}

Now, we’re interested in performance, so the question is: what sort of code does the compiler actually generate for this? Obviously, debug mode is not going to be very interesting for performance, so let’s compile in x86 Release mode and see what the compiler comes up with:

int main()
{
	RAND tRand;
	RandInit(tRand, 0);

	unsigned int nRandom = RandGetNum(tRand);
	printf("%u\n", nRandom);
00D11000  push        15AAA322h
00D11005  push        offset string "%u\n" (0D120F4h)
00D1100A  call        dword ptr [__imp__printf (0D120A0h)]
00D11010  add         esp,8
	return 0;
00D11013  xor         eax,eax
}

Wait a minute. What? Let’s translate that assembly language back into C code:

printf("%u\n", 0x15AAA322);

Wow, the compiler has completely optimized away the call to RandGetNum(). It has calculated what the output of the formula would be, and just replaced the result. What happens if we try it twice:

int main()
{
	RAND tRand;
	RandInit(tRand, 0);

	unsigned int nRandom = RandGetNum(tRand);
	nRandom = RandGetNum(tRand);
	printf("%u\n", nRandom);
	return 0;
}

The compiler generates:

01261000  push        0DEC01A79h
01261005  push        offset string "%u\n" (12620F4h)
0126100A  call        dword ptr [__imp__printf (12620A0h)]
01261010  add         esp,8
01261013  xor         eax,eax

Practically the same thing, just with a different constant. Let’s try a loop:

int main()
{
	RAND tRand;
	RandInit(tRand, 0);

	unsigned int nRandom;
	for(int i=0; i<5; i++)
	{
		nRandom = RandGetNum(tRand);
	}
	printf("%u\n", nRandom);
	return 0;
}

The compiler generates:

011B1006  push        9722BEA6h
011B100B  push        offset string "%u\n" (11B20F4h)
011B1010  call        dword ptr [__imp__printf (11B20A0h)]
011B1016  add         esp,8
011B1019  xor         eax,eax

The same thing! Wow. In fact, the Visual Studio compiler will follow the rabbit hole down as far as 10 times. After that, it gives up and starts to generate code. So, if we take i up to 11, then the compiler generates the following for the inside of the loop:

; Comments added by me
; For reference: v = 36969 * (v & 0xFFFFFFFF) + (v >> 32);
; ecx:eax contains v
01271017  mov         edx,9069h   ; Move 36969 into edx
0127101C  mul         eax,edx     ; eax contains the low part of V, so do the multiply
                                  ; and store the result in edx:eax
0127101E  xor         edi,edi     ; Clear edi
01271020  add         eax,ecx     ; Add the high part of V to the multiplied low part
                                  ; of V
01271022  adc         edx,edi     ; edi is 0, so this adds the carry flag from the
                                  ; eax+ecx operation to edx
01271024  dec         esi         ; decrement loop counter
01271025  mov         ecx,edx     ; Move edx (our temp register) into ecx (the high
                                  ; part of V)
01271027  jne         main+17h (1271017h)  ; loop

Neat! Excluding the loop instructions (dec/jne), our RNG boils down to six instructions in x86. It’s worth noting that, although 64-bit operations are nontrivial in 32-bit assembly code, this algorithm is particularly well-suited for it, because it only operates on 32-bit pieces of the 64-bit number.

Of course, that’s really just a pleasant side-effect. At its core, this really is a 64-bit operation, so let’s compile the same code for x64 and see what happens:

; Comments added by me
; For reference: v = 36969 * (v & 0xFFFFFFFF) + (v >> 32);
; rax contains v
000000013FB91020  mov         edx,eax         ; Move the low part of V into edx (and
                                              ; thus into rdx)
000000013FB91022  shr         rax,20h         ; Shift rax right 32, making rax contain
                                              ; the high part of V
000000013FB91026  imul        rdx,rdx,9069h   ; Multiply the low part of V by 36969
000000013FB9102D  add         rax,rdx         ; Add the multiplied low part to the
                                              ; high part
000000013FB91030  dec         r8              ; decrement loop counter
000000013FB91033  jne         main+20h (13FB91020h)  ; loop

The 64-bit version is even cooler: just four instructions, each of which maps exactly to an operation in the formula.

If we ever had any doubt as to whether the MWC RNG is efficient, this definitely puts those doubts to rest. We’ve also seen how the compiler can track down constant operations even across looping constructs.

Next time: It gets better

Leave a Comment

Explorations on Random Numbers, Part 2

Last time, we saw how to use rand() and why rand() is unacceptable. We also took a look at the Mersenne Twister, which is definitely a huge improvement over rand(), but we were put off by its memory footprint of 2.5K. So, how can we do better than Mersenne Twister?

Enter George Marsaglia, in a posting to Sci.Stat.Math on January 12th, 1999. (Either that, or December 1st, 1999, depending on how Bob Wheeler writes his dates.) In that post, he puts down 17 lines of code that implement 8 different RNGs. That sounds really promising, at least as a starting point. Let’s pick one of those and run with it and see where it takes us.

The RNG that we’re going to be using as our basis is Multiply With Carry (MWC). Let’s isolate Marsaglia’s code for just the MWC RNG:

#define UL unsigned long
#define znew  ((z=36969*(z&65535)+(z>>16))<<16)
#define wnew  ((w=18000*(w&65535)+(w>>16))&65535)
#define MWC   (znew+wnew)
static UL z=362436069, w=521288629;
void settable(UL i1, UL i2)
 { z=i1; w=i2; }

Hrm. The first obvious problem is that this RNG uses a static context. Fortunately, it’s small and simple enough that we can fix that on our own by taking z and w and putting them into a struct or class. But before we go down that rabbit hole, let’s examine what this code is actually doing:

We’ll start with the MWC macro. MWC adds the newly-generated Z value to the newly-generated W value. Fair enough. What do those do? The first thing you’ll notice is that there’s a pattern to the Z and W updates, so we’ll take apart the algorithm for the Z/W update:

  • Split off the 32-bit value into two 16-bit values, which we’ll call L for low and H for high.
  • Multiply L by the constant C (36969 for Z and 18000 for W).
  • Add the result to H. This is your new Z or W value.
  • Return the bottom 16 bits of the value. For Z, shift them over into the high 16 bits of the return value.

From this, we can determine that Z and W are 32-bit values, each of which represents a 16-bit RNG. The MWC RNG generates its 32-bit value by combining the two 16-bit RNGs into a single value in the high and low words of the value. Now that we understand that, what does it mean? It means that we have to pick *good* values for our initial Z and W, because if we don’t, then we really only have 16 bits worth of randomness.

It kind of sucks to have to pick two seeds for an RNG. Can we improve on that? How can we take Marsaglia’s original MWC RNG and improve it by just requiring one seed? Well, let’s look at what we determined about the Z and W values: each one is a 32-bit value that represents a 16-bit portion of the 32-bit RNG. Maybe we can extend that concept by selecting a 64-bit value which represents a 32-bit RNG. We’ll start by just updating Marsaglia’s code, then examine the process of making it practical for every-day use:

#define ULL unsigned long long
#define vnew  (v=36969*(v&0xFFFFFFFF)+(v>>32))
#define MWC   (vnew&0xFFFFFFFF)
static ULL v=0x159A55E51F123BB5; // 64-bit value compiled from original z and w as z<<32 + w
void settable(ULL i1)
 { v=i1; }

Beautiful! It’s tiny, it’s fast, it’s deterministic, it’s got decently long cycles…the only problem is the static context. Fortunately, that one is easy to fix. Let’s start by making a structure to hold the V value:

struct RAND
{
	unsigned long long v;
};

Now a function to initialize it:

#define DEFAULT_V_VALUE 0x159A55E51F123BB5
inline void RandInit(RAND& tRand, unsigned long long seed)
{
	if(seed == 0)
		seed = DEFAULT_V_VALUE;

	tRand.v = seed;
}

And finally, a function to actually generate a random number:

inline unsigned int RandGetNum(RAND& tRand)
{
	tRand.v = 36969 * (tRand.v & 0xFFFFFFFF) + (tRand.v >> 32);
	return tRand.v & 0xFFFFFFFF;
}

So there you have it! A small, fast, deterministic, long-cycled, non-static context RNG.

Next time: Exploring Compiler Optimizations

Leave a Comment

Explorations on Random Numbers, Part 1

In C, since time immemorial, the easiest way of getting a random number is thus:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main()
{
    srand(time(NULL));
    int nRandom = rand();
    printf("%d\n", nRandom);
    return 0;
}

Of course, this only gets you part of the way there. The problems with rand() are well-known and well-documented, but the brief summary is:

  • rand() only provides you with a random number up to RAND_MAX, which, on many compilers (including Visual Studio), is 0x7FFF.
  • rand() uses a static context. That is, I cannot have two separate Random Number Generators (RNGs) running at the same time with different seeds.
  • Less interestingly for us, but still nonetheless true, rand() is not acceptable for use in cryptography situations

I could go on and on, but this post isn’t about what’s wrong with rand(). rand() is just a starting point. What we’re after is an RNG that is effective and efficient for use in games. Specifically, we need it to have the following properties:

  • It must be small (memory-wise)
  • It must be fast
  • It must be deterministic given the same seed
  • It must have a long enough cycle that any practical day-to-day use will not experience the cycles
  • It must have a non-static context

Please note:

  • If you need an RNG for cryptographic purposes, use a library that is cryptographically strong. I repeat, none of this subject matter applies to any security or cryptographic subject!

The quote-unquote “standard” wisdom prevalent on the Internet is “Just use Mersenne Twister and be done with it.” Hmmm… What is this “Mersenne Twister”? Wikipedia to the rescue! Wikipedia also has a link to the Mersenne Twister home page. So, let’s examine Mersenne Twister against the above requirements: it’s deterministic, long-cycled, and has a non-static context. All good things. The algorithm is simple and seems fast. However, we’re not going to delve into the speed issue, because my eye is immediately drawn to the following line at the very beginning of the pseudocode presented in the Wikipedia page:

// Create a length 624 array to store the state of the generator
int[0..623] MT
int index = 0

In other words, the Mersenne Twister context uses up 2500 bytes! That’s 2.5K for an RNG! Granted, a very good one, but still…2.5K! Surely we can do better than that.

Next time: how we can do better.

Leave a Comment

Tangle Trap’s First Blender Foundation Support

Well folks, Tangle Trap has been out for almost two months now and we’ve finally received the reports for its first month of sales. As promised, 25% of the opening weekend sales and 5% of ongoing sales will be donated to the Blender Foundation to help support the excellent work they do. Here’s the skinny:

Opening Weekend: $33.48 in sales equals $8.37 for the Blender Foundation.

April 11 – May 8: $57.71 in sales equals $2.89 for the Blender Foundation.

That’s a total of $11.26, which comes out to EUR 7.89. Not huge, but enough to get the train rolling!

Leave a Comment

Multithreaded Programming Part 3: Going Lockless

Ultimately, the whole point of a lock is to take a multi-part operation and make it atomic. The AddDagron() function from the previous post in the series, for example, takes the operations of incrementing the Dagron count, reallocating the array of Dagron pointers, and adding the new Dagron to the array, and effectively turns it into one operation. In order to accomplish that, though, we have a large structure (CRITICAL_SECTION takes up around 24 bytes in 32-bit), and we are forced to acknowledge that our quote-unquote “atomic” operation may actually be broken up into multiple pieces.

Wouldn’t it be cool if we could actually execute a complex sequence of operations in one CPU instruction? If we had such an operation, then there would be no way to interrupt it in the middle, and it would be truly atomic.

Fortunately, if we’re willing to narrow the scope of what operations we’d like to perform, then we already live in that world. Modern processors provide a set of instructions that let you perform some very interesting operations in precisely one CPU instruction.

You can see a full list of all Interlocked operations in Windows on this page. They all have esoteric and slightly scary-sounding names like InterlockedCompareExchangePointerAcquire. While I could tell you not to worry and that it’s not that complicated, I also don’t want to give a false sense that it’s not that complicated or hard.

So, let me get this out right now:

Tip #5: Lockless programming is complicated. It is hard. It is not for the faint of heart. It will suck your soul, give you heartburn, a migraine, and an ulcer. And that’s only if you’re doing the easy stuff.

That said, let’s delve right in! Understanding what the Interlocked functions do takes a bit of practice, but the one thing you have to keep in mind is that we are performing multiple operations in one atomic instruction. Let’s take a look at some examples, which I hope will illustrate the concept:

Note: All of these examples are PSUEDOCODE! If you try to use them as real code, then shame on you. We’ll start with the basics:

// PSEUDOCODE!  Do not use!  I repeat, DO NOT USE!
DWORD InterlockedIncrement(volatile DWORD* pValue)
{
    return ++(*pValue);
}

DWORD InterlockedDecrement(volatile DWORD* pValue)
{
    return --(*pValue);
}

Well, that’s anticlimactic, right? What’s so special about these operations? Let’s take a look at the non-interlocked version of the same code in a sort of pseudoassembly:

; ++(*pValue)
mov reg1, *pValue  ; Read *pValue into a register
inc reg1           ; Increment the value
mov *pValue, reg1  ; Write the value back into memory

Hmmm… That’s three operations. As per Tip #2, our code can be interrupted at any time, which means that somebody could come in and change the value of pValue after I’ve already read it, and then the value that I’m writing back to memory is wrong.

The InterlockedIncrement() and InterlockedDecrement() functions perform the same three (actually four, with the return) operations, but in one single atomic instruction, which is guaranteed by the operating system to be atomic. (Note that, while many or most of the Interlocked operations will be implemented at the CPU level as a single instruction, there is nothing requiring that to be the case. That’s why this guarantee is made by the OS, not by the CPU.)

Let’s take a look at a more interesting example:

// PSEUDOCODE!  Do not use!  I repeat, DO NOT USE!
DWORD InterlockedCompareExchange(volatile DWORD* pValue, DWORD dwExchange, DWORD dwComparand)
{
    if(*pValue == dwComparand)
    {
        *pValue = dwExchange;
        return dwComparand;
    }
    return *pValue;
}

Wow! All that in one CPU instruction. It’s probably not entirely obvious why that particular function is useful. The brief version of the answer is that it lets you try to assign a new value, but only if no other thread has come in and messed with your data in the meantime. You often see code like this:

DWORD dwOriginalValue = dwValue;
DWORD dwNewValue = SomeCalculation(dwValue);
if(InterlockedCompareExchange(&dwValue, dwNewValue, dwOriginalValue) != dwOriginalValue)
{
    // Some other thread has come in and messed with the value at the same time!
    // Deal with the situation.  What that means is very specific to your situation, but
    // there are a number of useful patterns.
}

I will not go into a lot of detail here on patterns and practices with lockless programming for the simple reason that somebody far more qualified than I has recently done a full 15-part blog series on lockless programming. Instead of attempting to do justice to Raymond Chen’s expertise, I will simply link to his posts:

Lock-free algorithms: Choosing a unique value (warm-up)
Lock-free algorithms: The singleton constructor
Lock-free algorithms: Choosing a unique value (solutions)
Lock-free algorithms: The one-time initialization
Lock-free algorithms: The singleton constructor (answer to exercises)
Patterns for using the InitOnce functions
Lock-free algorithms: The try/commit/(try again) pattern
Lock-free algorithms: Update if you can I’m feeling down
Lock-free algorithms: The opportunistic cache
Lock-free algorithms: The try/commit/(hand off) model
Don’t forget to include the message queue in your lock hierarchy
Visual Studio 2005 gives you acquire and release semantics for free on volatile memory access
Corrections to Patterns for using the InitOnce functions
The performance improvements of a lock-free algorithm is often not in the locking
Even if you have a lock, you can borrow some lock-free techniques

I heartily encourage you to read these posts (and the rest of Raymond’s blog). They are pure gold. Which brings me up to the next and final tip:

Tip #6: Just because lockless programming is hard, doesn’t mean you shouldn’t do it.

It just means that you should educate yourself, and that you should be diligent and very very careful when you do.

I hope that this series has been helpful and educational. Be sure to check out Tangle Trap if you have an iOS device, and keep an eye on this space for more technical articles, announcements, and updates from me and the rest of the BitFlip Games staff!

Leave a Comment

Multithreaded Programming Part 2.5: MRSW Lock Code

<– Part 2.

As promised, here is our modified version of Glenn Slayden‘s Multi-Reader-Single-Writer lock code. We’d like to thank Glenn for posting the original code.

//
// Multi-reader Single-writer concurrency base class for Win32
//
// (c) 1999-2003 by Glenn Slayden (glenn@glennslayden.com)
//
// http://www.glennslayden.com/code/win32/reader-writer-lock
// This code has been modified from the original, but retains the original spirit.
// The major differences are:
// - Constructor/Destructor turned into Init/Free calls to better match with the
//   rest of the GSD Engine code.  (If you are using this in your own code, feel
//   free to switch this back.)
// - Added UpgradeToWriter() and DowngradeToReader()
// - Reorganized internals into functions
// - Commented the hell out of it.
class MultiReaderSingleWriterLock
{
private:
	CRITICAL_SECTION m_csWrite;         // The writer lock
	CRITICAL_SECTION m_csReaderCount;   // Lock to protect m_cReaders
	long m_cReaders;                    // The current count of readers
	HANDLE m_hevReadersCleared;         // Waitable event which, when set, indicates that there are NO readers

public:
	// Initiailize the lock.  If you are taking this for your own code, feel free to turn this
	// into a constructor.
	void Init()
	{
		m_cReaders = 0;
		InitializeCriticalSection(&m_csWrite);
		InitializeCriticalSection(&m_csReaderCount);
		m_hevReadersCleared = CreateEvent(NULL, TRUE, TRUE, NULL);
	}

	// Free the lock.  If you are taking this for your own code, feel free to turn this into
	// a destructor.
	void Free()
	{
		// Make sure that there are no more readers before we free
		WaitForSingleObject(m_hevReadersCleared, INFINITE);
		CloseHandle(m_hevReadersCleared);
		DeleteCriticalSection(&m_csWrite);
		DeleteCriticalSection(&m_csReaderCount);
	}

	// Enter the critical section as a reader.  You are guaranteed that there will be no
	// writers so long as you hold this lock.
	void EnterReader(void)
	{
		// Enter the writer critical section.  If there are any writers writing, this
		// will prevent new readers from coming in.
		EnterCriticalSection(&m_csWrite);

		// Add as a reader, and ensure that the write to the reader count is locked during the update.
		AddReaderLocked();

		// Leave the writer critical section.  The existence of a reader will prevent
		// the writer from doing the write.
		LeaveCriticalSection(&m_csWrite);
	}

	// Leave the critical section as a reader.  All bets are off as to what the state of
	// the data will be once you've left the lock.
	void LeaveReader(void)
	{
		// Remove as a reader.  The reader count is always locked when removing as a reader.
		RemoveReader();
	}

	// Enter the critical section as a writer.  Once this function returns, you are
	// guaranteed that there are no other writers, and that there are no other readers.  This
	// function blocks waiting for other writers and readers to free up.
	void EnterWriter(void)
	{
		// Enter as a writer.
		EnterCriticalSection(&m_csWrite);

		// Wait for all of the readers to clear out, so that you don't stomp on them.
		WaitForSingleObject(m_hevReadersCleared, INFINITE);
	}

	// Upgrade a reader lock to a writer lock.
	// Be careful when upgrading to writer.  It is not atomic, and somebody can come in
	// and write to your data while you're waiting to acquire the writer lock.  Be sure
	// to re-read any important data after upgrading to a writer.
	void UpgradeToWriter(void)
	{
		// When upgrading to a writer, we cannot do this atomically.  Imagine, for a moment
		// that we reversed the order (or, rather, that we interleaved the processes) of
		// leaving as a reader and entering as a writer.  The new order of operations would
		// be:
		// - Enter writer critical section
		// - Leave as a reader
		// - Wait for all readers to clear
		// This fails if two threads are simultaneously attempting to upgrade to writers.
		// Consider the following:
		// - Thread A enters writer cs
		// - Thread B attempts to enter writer cs, but blocks
		// - Thread A leaves as a reader
		// - Thread A waits for all readers to leave
		// We are now deadlocked, because Thread B is still technically a reader.
		// The only way to make this truly safe is to first leave as a reader, then enter
		// as a writer.  The problem is that this creates a window of opportunity where
		// your thread is effectively not participating in the lock (in between the leave
		// and the enter), so some other thread could come in and write to the data in
		// between these two lines.
		// Thus, the warning above: when you upgrade a reader to a writer, be sure to
		// re-read the data to make sure that you're operating on the latest stuff.
		LeaveReader();
		EnterWriter();
	}

	// Downgrade a writer lock to a reader lock.
	// Downgrading a writer to a reader is atomic.  You do not need to re-read the data
	// after downgrading to a reader.
	void DowngradeToReader(void)
	{
		// When we are a writer, we are *guaranteed* that there are no other writers,
		// AND that there are no other readers.  We are the ONLY thread that is allowed
		// inside of the lock.  So, we are free to add ourselves as a reader *without*
		// locking the reader lock.
		// Let's check that statement:
		// - I own the writer lock, which means that there is nobody in EnterReader(), which
		//   is the only place that AddReaderLocked() is called from.
		// - There are no other readers, so there is nobody who should (legitimately) be calling
		//   RemoveReader()
		// The above two statements mean that there is guaranteed to be no contention for the
		// the reader count lock, so I can definitely add myself as a reader without having to
		// acquire the lock.
		AddReaderNoLock();

		// Release the writer lock.  I have now downgraded myself atomically from a writer to a reader.
		LeaveWriter();
	}

	// Leave the critical section as a writer.  Once this has executed, new readers or writers
	// can come in.  You are not guaranteed to be a reader after LeaveWriter() is called.  If
	// you need to continue reading the data after leaving as a writer, then use DowngradeToReader()
	// instead.
	void LeaveWriter(void)
	{
		// Leave the writer critical section
		LeaveCriticalSection(&m_csWrite);
	}

private:
	// Add a reader without locking the reader count lock
	void AddReaderNoLock()
	{
		// Increment the count of readers.  If we've increased
		// from 0 to 1, then reset the "readers cleared" event.  This means that
		// there IS a reader, and that any code that is waiting for the event
		// will block.
		if (++m_cReaders == 1)
			ResetEvent(m_hevReadersCleared);
	}

	// Add a reader, and protect it with the reader count lock.
	void AddReaderLocked()
	{
		// Protect the increment with the reader count lock
		EnterCriticalSection(&m_csReaderCount);

		// Add the reader
		AddReaderNoLock();

		// Leave the reader count lock
		LeaveCriticalSection(&m_csReaderCount);
	}

	// Remove a reader.  Removes are always protected by the reader count lock.
	void RemoveReader()
	{
		// We always want to protect reader count decrement operations with
		// the reader count lock
		EnterCriticalSection(&m_csReaderCount);

		// Decrement the count of readers.  If we've decreased from 1 to 0, then
		// set the "readers cleared" event.  This means that there are NO readers,
		// and that any code that is waiting for the event will unblock.
		if (--m_cReaders == 0)
			SetEvent(m_hevReadersCleared);

		// Leave the reader count lock
		LeaveCriticalSection(&m_csReaderCount);
	}
};

Go to Part 3.

Comments (1)

Multithreaded Programming Part 2: Multiple Reader/Single Writer Lock

<–Part 1

Last time, we saw how to protect a shared resource with a critical section. Today we’ll look at why a critical section, though technically correct, can cause problems. Let’s write some critical-section-protected code and see what might happen:

SharedResource gSharedResource;
CRITICAL_SECTION g_csResourceLock; // Assume that this has been initialized properly

Glob GlobulateSharedResource()
{
    Glob returnValue;

    EnterCriticalSection(&g_csResourceLock);
    for(int i=0; i<gSharedResource.dagronCount; i++)
    {
        Globulate(returnValue, gSharedResource.dagrons[i]);
    }
    LeaveCriticalSection(&g_csResourceLock);

    return returnValue;
}

Dagron* AddDagron()
{
    Dagron* pNewDagron = new Dagron;

    EnterCriticalSection(&g_csResourceLock);
    gSharedResource.dagronCount++;
    gSharedResource.dagrons = (Dagron**)realloc(gSharedResource.dagrons, gSharedResource.dagronCount * sizeof(Dagron*));
    gSharedResource.dagrons[gSharedResource.dagronCount-1] = pNewDagron;
    LeaveCriticalSection(&g_csResourceLock);

    return pNewDagron;
}

Now, this is fine code. It has no threading issues…or does it? What if our thread function looked like this:

unsigned int __stdcall HappyThread(void* pData)
{
    Glob glob = GlobulateSharedResource();
    Flazmify(glob);
    if(ShouldAddDagron())
      AddDagron();
}

Now we spin up two of those threads. What’s going to happen? Let’s see:

  • Thread A enters GlobulateSharedResource() and acquires the lock
  • Thread B enters GlobulateSharedResource() and attempts to acquire the lock, but the lock is taken, so it blocks
  • Thread A globulates all of the Dagrons, then leaves the critical section and returns its glob
  • Thread B now globulates all of the dagrons, but is interrupted before it can finish the operation
  • Thread A flazmifies the glob, and sees that it should add a Dagron. It enters AddDagron(), but is blocked when trying to acquire the lock
  • Thread B finishes globulating the Dagrons
  • Thread A adds its Dagron, then goes on to Globulate again

Sure, it’s all quote-unquote “correct,” but do you see how in the first couple of steps, only one of the two threads can be globulating the shared resource at the same time, even though there’s no reason to prevent that? You can imagine how much worse it would be if we spun up three, or four, or a dozen of these threads. But we can’t eliminate the lock, because then one thread could be adding a Dagron at the same time as another is reading the list, which would cause all sorts of badness to occur.

This brings us to the next tip:

Tip #4: A contentious lock is the same as single-threaded code.

In the above code, the g_csResourceLock critical section is so contentious that, even though we’ve got multiple threads going at once, only one of them can really be doing work at a time. Ironically, the problem actually gets worse the more threads there are.

So, what do we do? Enter the multiple-reader-single-writer lock. In general, you can have any number of threads reading a particular piece of data, so long as only one thread is actually writing to it. The multi-reader-single-writer lock encapsulates that concept by allowing a thread to enter the lock in different ways: as either a reader or a writer (sometimes called shared mode and exclusive mode respectively). A reader will block out a writer, a writer will block out readers and other writers, but a reader will not block out another reader. Unfortunately, the waters get a bit murkier here when trying to find an implementation of such a construct to use:

  • In Windows, there is the Slim Reader/Writer Lock. Unfortunately, these locks are only available if you’re targeting Windows Vista or higher. If you want to target Windows XP, then you’ve got to search elsewhere.
  • Boost has an implementation. However, if you’re not already using boost, it can be a big dependency to take just for threading primitives, especially since the threading library is one that requires a compile (which can be complicated and time-consuming for Boost).
  • Roll your own using other primitives. This is the route that we’ve gone in the GSD Engine, using Glenn Slayden‘s implementation as a baseline. It’s clean, decently efficient, correct, and works on Windows XP and up. You could very easily convert this to work on another platform as well. We have actually made some minor modifications to that code, which we will post next time (this post is already long enough as is).

Let’s take a look at how we’ll change our code to take advantage of a multi-reader-single-writer lock:

SharedResource gSharedResource;
MultiReaderSingleWriterLock g_csResourceLock; // Assume that this has been initialized properly

Glob GlobulateSharedResource()
{
    Glob returnValue;

    g_csResourceLock.EnterReader();
    for(int i=0; i<gSharedResource.dagronCount; i++)
    {
        Globulate(returnValue, gSharedResource.dagrons[i]);
    }
    g_csResourceLock.LeaveReader();

    return returnValue;
}

Dagron* AddDagron()
{
    Dagron* pNewDagron = new Dagron;

    g_csResourceLock.EnterWriter();
    gSharedResource.dagronCount++;
    gSharedResource.dagrons = (Dagron**)realloc(gSharedResource.dagrons, gSharedResource.dagronCount * sizeof(Dagron*));
    gSharedResource.dagrons[gSharedResource.dagronCount-1] = pNewDagron;
    g_csResourceLock.LeaveWriter();

    return pNewDagron;
}

The thread function remains the same. Let’s take a look at what happens now:

  • Thread A enters GlobulateSharedResource() and enters as a reader
  • Thread B enters GlobulateSharedResource() and enters as a reader
  • Thread A begins globulating the Dagrons, but is interrupted before it can finish
  • Thread B globulates all of its Dagrons
  • Thread B flazmifies the resulting glob
  • Thread B starts to add a new Dagron. However, EnterWriter() blocks because there is a reader
  • Thread A finishes globulating the Dagrons and leaves as a reader
  • Thread A begins flazmifying the glob
  • Thread B is now able to enter as a writer and begin adding a Dagron, but is interrupted before it can finish
  • Thread A finishes flazmifying its glob
  • Thread A skips adding a Dagron, and goes back into globulating the Dagrons
  • Thread A blocks on EnterReader(), because there exists a writer
  • Thread B finishes adding the Dagron, and leaves as a writer
  • Thread B begins globulating the Dagrons
  • etc.

Now threads A and B can globulate the dagrons at the same time, and it’s only when a thread wants to add a new Dagron that there is ever any contention for the lock. Assuming that adding a Dagron is more rare than globulating the list of Dagrons, then we’ve just made a radical improvement in the performance of our code.

Next time: the dreaded lockless constructs.

Go to Part 2.5.

Comments (2)

Multithreaded Programming Part 1: The Critical Section Lock

Most of the time, when we’re programming, we can think about our programs in a very sequential fashion: First do this, then do that, then finish with the other. This is even true with modern, multicore architectures, including the PC, XBox 360, and PlayStation 3. Usually, it’s for one of two reasons: Either there is no concurrency in our system, or the concurrency has been abstracted out so that we don’t have to think about the hard problems associated with it.

Obviously, it’s desirable to be in the second camp, and there are lots of ways to get there. The problem is that, in order to get there properly, somebody has to write the fancy-dancy-multi-threaded-code. Go ahead, guess what I’m working on right now?

Call me “Somebody.”

Writing thread-safe code is hard. Writing efficient thread-safe code is extra hard. Writing efficient, lockless, thread-safe code is mind-bogglingly hard: the sort of thing that will send people to the loony bin if they’re not careful. Fortunately, with a lot of patience, a lot of diligence, and a good source of information, you can avoid the loony bin.

In my current project, I’ve had to use:

  • A simple critical section lock
  • A multi-reader-single-writer lock
  • InterlockedExchangePointer
  • InterlockedCompareExchangePointer
  • InterlockedCompareExchange

All of those are for one data structure. (Well, actually a set of two data structures designed to work together, but you get the picture.) In this series of posts, I’m going to be talking about multithreaded programming, and giving tips that I’ve learned for writing efficient thread-safe code using the locks and lock-free structures I just listed.

I’ll be using the Win32 API functions for all of the code in this series, but there exist equivalents in pretty much any other system that you’d care to use. Now, on to the good stuff: the Critical Section Lock.

A critical section is a lightweight type of mutual exclusion lock. There are many types of mutual exclusion locks, including one which is generally called a mutex. The mutex lock is typically a heavier-weight lock, which can control access to a shared resource across processes. Critical sections are designed to control access to a shared resource within a single process.

This “shared resource” I’m referring to can be anything; it’s defined by your program. So, let’s say that I have the following code:

SharedResource gRes;

void FrobSharedResource()
{
    Frob(gRes);
}

This is fine code, but it falls apart as soon as you have more than one thread trying to Frob() the resource.

Which brings us to my first tip for multithreaded programming. This tip (which I learned from Raymond Chen) is applicable to far more than just multithreaded programming, but it’s particularly helpful here, so here it is:

Tip #1: What if two threads did this?

In our example, Frob() is a void function, so it’s probably transforming the shared resource in some way. If you had two threads both Frobbing the resource at the same time, your code would end up in a strange, inconsistent state. Don’t believe me? What if our Frob function looked like this:

void Frob(SharedResource& res)
{
    if(res.spoo->fleem > res.fleem_limit)
    {
        delete res.spoo;
        res.spoo = NULL;
    }
}

Now, imagine the following scenario:

  • Thread A enters Frob and sees that the spoo has too much fleem
  • Thread A gets interrupted by the OS
  • Thread B gains control
  • Thread B enters Frob and sees that the spoo has too much fleem
  • Thread B gets interrupted by the OS
  • Thread A gains control
  • Thread A deletes the spoo
  • Thread A gets interrupted by the OS
  • Thread B gains control
  • Thread B deletes the spoo
  • CRASH! The spoo has been double-deleted

And that’s in a function with three lines of code. Imagine the debugging when you have some real code with hundreds of lines. Which brings me to the second tip:

Tip #2: Your threads can be interrupted at any time. Yes, even then. Yes, then too. No really, any time means any time.

With very few exceptions, your code can be interrupted at any time for any length of time. There are any number of things that could interrupt your thread: the scheduling quantum could expire, a device interrupt could go off, a timer ticks… Basically, your computer is full of entropy. It’s entirely possible that your thread will wake up, execute exactly one processor instruction, and then get interrupted.

Yeah,” you may say, “but what are the odds of that happening in my code? It’s gotta be, like, one in a million!

This brings up the third tip:

Tip #3: One in a million is next Tuesday.

You need to deal with these situations. If your code is executed a million times in a week, then one in a million is next Tuesday. Somebody is going to run into that exact sequence, and your program is going to crash, and you’ll have nobody to blame but yourself.

Fix your threading bugs before they happen. If you see it, fix it. If you don’t see any problems, go looking for them, because chances are they exist. Don’t depend on chance to save your butt, because it won’t. I approach every threaded data structure that I have to debug as though every line of code contains at least one threading bug.

So, we’ve seen all of the problems that this little chunk of code can cause. Let’s see what we can do to fix it.

Let’s take a closer look at the critical section. In Win32, we would do the following:

SharedResource gRes;
CRITICAL_SECTION g_csResourceLock;

void InitSharedResourceLock()
{
    InitializeCriticalSection(&g_csResourceLock);
}

void ShutdownSharedResourceLock()
{
    DeleteCriticalSection(&g_csResourceLock);
}

void FrobSharedResource()
{
    EnterCriticalSection(&g_csResourceLock);
    Frob(gRes);
    LeaveCriticalSection(&g_csResourceLock);
}

We add some new functions to initialize and clean up the critical section object. The magic happens with the EnterCriticalSection() and LeaveCriticalSection() functions. The Enter function will acquire an exclusive lock, so that no other thread can access the lock at the same time. If another thread tries to access the lock, then it will be blocked until the first thread leaves the critical section. It’s a simple synchronization primitive that is lightweight and very effective.

Let’s try to break this code by running through a “what if two threads did this” problem:

  • Thread A enters FrobSharedResource and enters the critical section
  • Thread A gets interrupted, Thread B gains control
  • Thread B enters FrobSharedresource() and attempts to enter the critical section, but it’s already owned by Thread A, so it blocks, Thread A gains control
  • Thread A frobs the resource
  • Thread A gets interrupted, but Thread B is blocked waiting for the critical section, so Thread A gains control again
  • Thread A leaves the critical section
  • Thread A gets interrupted, Thread B gains control
  • Thread B now has the critical section
  • Thread B frobs the resource
  • Thread B leaves the critical section

That worked well. Let’s try again:

  • Thread A enters FrobSharedResource, but gets interrupted before it can enter the critical section
  • Thread B gains control
  • Thread B enters FrobSharedResource, and enters the critical section
  • Thread B gets interrupted, Thread A gains control
  • Thread A attempts to enter the critical section, but it’s already owned by Thread B, so it blocks, Thread B gains control
  • From here, execution is basically the same as before, except that Threads A and B are reversed.

It works! With two lines of code, we’ve eliminated our threading issues.

Next time, we’ll look at some situations where a critical section isn’t quite good enough for our needs.

Go to Part 2.

Leave a Comment