The last few posts, we’ve explored various random number generators, and ended up with a very simple and solid implementation of George Marsaglia’s Multiply-With-Carry RNG. At this point, we’re done talking about random number generators. We’re going to switch gears and talk about how the compiler actually takes the RNG code and turns it into machine code.

Warning: we will be delving into assembly language, so be prepared!

Let’s start with some very simple code that uses our MWC RNG:

int main()
{
	RAND tRand;
	RandInit(tRand, 0);

	unsigned int nRandom = RandGetNum(tRand);
	printf("%u\n", nRandom);

	return 0;
}

Now, we’re interested in performance, so the question is: what sort of code does the compiler actually generate for this? Obviously, debug mode is not going to be very interesting for performance, so let’s compile in x86 Release mode and see what the compiler comes up with:

int main()
{
	RAND tRand;
	RandInit(tRand, 0);

	unsigned int nRandom = RandGetNum(tRand);
	printf("%u\n", nRandom);
00D11000  push        15AAA322h
00D11005  push        offset string "%u\n" (0D120F4h)
00D1100A  call        dword ptr [__imp__printf (0D120A0h)]
00D11010  add         esp,8
	return 0;
00D11013  xor         eax,eax
}

Wait a minute. What? Let’s translate that assembly language back into C code:

printf("%u\n", 0x15AAA322);

Wow, the compiler has completely optimized away the call to RandGetNum(). It has calculated what the output of the formula would be, and just replaced the result. What happens if we try it twice:

int main()
{
	RAND tRand;
	RandInit(tRand, 0);

	unsigned int nRandom = RandGetNum(tRand);
	nRandom = RandGetNum(tRand);
	printf("%u\n", nRandom);
	return 0;
}

The compiler generates:

01261000  push        0DEC01A79h
01261005  push        offset string "%u\n" (12620F4h)
0126100A  call        dword ptr [__imp__printf (12620A0h)]
01261010  add         esp,8
01261013  xor         eax,eax

Practically the same thing, just with a different constant. Let’s try a loop:

int main()
{
	RAND tRand;
	RandInit(tRand, 0);

	unsigned int nRandom;
	for(int i=0; i<5; i++)
	{
		nRandom = RandGetNum(tRand);
	}
	printf("%u\n", nRandom);
	return 0;
}

The compiler generates:

011B1006  push        9722BEA6h
011B100B  push        offset string "%u\n" (11B20F4h)
011B1010  call        dword ptr [__imp__printf (11B20A0h)]
011B1016  add         esp,8
011B1019  xor         eax,eax

The same thing! Wow. In fact, the Visual Studio compiler will follow the rabbit hole down as far as 10 times. After that, it gives up and starts to generate code. So, if we take i up to 11, then the compiler generates the following for the inside of the loop:

; Comments added by me
; For reference: v = 36969 * (v & 0xFFFFFFFF) + (v >> 32);
; ecx:eax contains v
01271017  mov         edx,9069h   ; Move 36969 into edx
0127101C  mul         eax,edx     ; eax contains the low part of V, so do the multiply
                                  ; and store the result in edx:eax
0127101E  xor         edi,edi     ; Clear edi
01271020  add         eax,ecx     ; Add the high part of V to the multiplied low part
                                  ; of V
01271022  adc         edx,edi     ; edi is 0, so this adds the carry flag from the
                                  ; eax+ecx operation to edx
01271024  dec         esi         ; decrement loop counter
01271025  mov         ecx,edx     ; Move edx (our temp register) into ecx (the high
                                  ; part of V)
01271027  jne         main+17h (1271017h)  ; loop

Neat! Excluding the loop instructions (dec/jne), our RNG boils down to six instructions in x86. It’s worth noting that, although 64-bit operations are nontrivial in 32-bit assembly code, this algorithm is particularly well-suited for it, because it only operates on 32-bit pieces of the 64-bit number.

Of course, that’s really just a pleasant side-effect. At its core, this really is a 64-bit operation, so let’s compile the same code for x64 and see what happens:

; Comments added by me
; For reference: v = 36969 * (v & 0xFFFFFFFF) + (v >> 32);
; rax contains v
000000013FB91020  mov         edx,eax         ; Move the low part of V into edx (and
                                              ; thus into rdx)
000000013FB91022  shr         rax,20h         ; Shift rax right 32, making rax contain
                                              ; the high part of V
000000013FB91026  imul        rdx,rdx,9069h   ; Multiply the low part of V by 36969
000000013FB9102D  add         rax,rdx         ; Add the multiplied low part to the
                                              ; high part
000000013FB91030  dec         r8              ; decrement loop counter
000000013FB91033  jne         main+20h (13FB91020h)  ; loop

The 64-bit version is even cooler: just four instructions, each of which maps exactly to an operation in the formula.

If we ever had any doubt as to whether the MWC RNG is efficient, this definitely puts those doubts to rest. We’ve also seen how the compiler can track down constant operations even across looping constructs.

Next time: It gets better