Knowing Your Hardware ALU Shifter When Generating 64-bit Bit Masks

Yesterday I was very confused when one of the unit tests in a paper’s project failed. Both the unit test and the code to be tested are extremely simple such that no one would expect a failure to occur. The code to be tested is a one-line macro for generating 64-bit masks (of type uint64_t) using bit shift and decrement, as shown below. The unit test simply enumerates all possible cases, and checks the output. bit64_test() is another macro that checks whether a bit is set or not on a given offset. If the bit is set, it returns one. Otherwise it returns zero.

Mask Genetation:

#define MASK64_LOW_1(num)  ((1UL << num) - 1)

Unit Test:

void test_mask() {
  for(int bits = 0;bits <= 64;bits++) {
    uint64_t value1 = MASK64_LOW_1(bits);
    for(int i = 0;i < 64;i++) {
      if(i < bits) assert(bit64_test(value1, i) == 1);
      else assert(bit64_test(value1, i) == 0);
    }
  }
  return;
}

With the assistance of gdb, it took me another 20 seconds to figure out that the failed case occurred when the value of test variable bits is 64, meaning that all bits in the output uint64_t should be set to one. This finding, however, puzzled me even more: How could this be a failure?

When the macro argument num equals 64, 1UL << num should output zero, since this piece of code is compiled on a 64-bit x86-64 architectire, where the native register size is 64 bits, and the compiler will map this macro to an shl (or sal, which is essentially the same, since left shifts do not deal with sign bits) assembly instruction. Left shifting 0x1UL by 64 bits will just result in the only bit being shifted out, which outputs zero.

To confirm my reasoning, I wrote another one line test as follows:

Unit Test:

void test_mask2() {
  printf("0x%lX 0x%lX\n", 0x1UL << 64, (0x1UL << 64) - 1);
  return;
}

The output of this one line test is 0x0 0xFFFFFFFFFFFFFFFF as expected. So the question is, why the original macro failed, but a manual expansion passed the test?

After some investigation, here is the explanation: Intel x86-64 architecture specifies that, for a 64 bit shift instruction, the explicit or implicit second operand, which is the number of bits to be shifted, will be truncated before sending to the ALU. In other words, due to the fact that the native word size is 64 bits, the ALU can handle a shift amount of as many as 63, despite that any 8-bit value (stored in CL implicitly, or given as immediate number explicitly) can be supplied. Bit 6 and 7 will always be masked off before the ALU performs the shift. In our case, this translates to 0x1UL << num outputting 0x1UL unchanged when the value of num is 64, since the ALU will only see 0 after the 6th bit of value 64 is masked off.

Interesting enough, a side note in the architecture specification states that in the original 16-bit 8086 design, the shift amount is not masked, meaning that the hardware shifter bahaves more consistently when the shift amount exceeds the native word bit length. Starting from 80286, the masking feature is added to ALU to reduce instruction latency, and is kept even when the hardware executes in virtual-8086 mode, resulting in a major source of incompatibility. Any 8086 era software that generates bit masks assuming the older ALU shifter behavior will fail to execute correctly in some cases.

As for why the second test case gives the correct answer: When constant numbers are used in an expression, the compiler will always evaluate the constant expression at compilation time. Unfortunately, gcc’s constant evaluation is not properly programmed to match hardware behavior, although it does prints out a warning saying that the constant shift amount exceeds the bit length of the source operand:

test.c:104:33: warning: left shift count >= width of type [-Wshift-count-overflow]
   printf("0x%lX 0x%lX\n", 0x1UL << 64, (0x1UL << 64) - 1);
                                 ^
test.c:104:47: warning: left shift count >= width of type [-Wshift-count-overflow]
   printf("0x%lX 0x%lX\n", 0x1UL << 64, (0x1UL << 64) - 1);

To confirm that gcc constant evaluation does not match hardware behavior, I wrote a test program as follows:

#include <stdio.h>

int main() {
  unsigned long x = 0x1UL << 64;
  printf("%lX\n", x);
  return 0;
}

The resulting disassembly code after compiling this with default arguments to gcc is as follows:

0000000000400526 <main>:
  400526:	55                   	push   rbp
  400527:	48 89 e5             	mov    rbp,rsp
  40052a:	48 83 ec 10          	sub    rsp,0x10
  40052e:	48 c7 45 f8 00 00 00 	mov    QWORD PTR [rbp-0x8],0x0
  400535:	00 
  400536:	48 8b 45 f8          	mov    rax,QWORD PTR [rbp-0x8]
  40053a:	48 89 c6             	mov    rsi,rax
  40053d:	bf e4 05 40 00       	mov    edi,0x4005e4
  400542:	b8 00 00 00 00       	mov    eax,0x0
  400547:	e8 b4 fe ff ff       	call   400400 <printf@plt>
  40054c:	b8 01 00 00 00       	mov    eax,0x1
  400551:	c9                   	leave  
  400552:	c3                   	ret

As you can see, the value of 0x1UL << 64 has been evaluated at compilation time, which is zero. [rbp-0x8] is just the stack location of local variable x. A different result would occur, if you change the above test code to the following:

#include <stdio.h>

int main() {
  int y = 64;
  unsigned long x = 0x1UL << y;
  printf("%lX\n", x);
  return 0;
}

The output of this code snippet is 1 instead of 0, with the disassembly being:

0000000000400526 <main>:
  400526:	55                   	push   rbp
  400527:	48 89 e5             	mov    rbp,rsp
  40052a:	48 83 ec 10          	sub    rsp,0x10
  40052e:	c7 45 f4 40 00 00 00 	mov    DWORD PTR [rbp-0xc],0x40
  400535:	8b 45 f4             	mov    eax,DWORD PTR [rbp-0xc]
  400538:	ba 01 00 00 00       	mov    edx,0x1
  40053d:	89 c1                	mov    ecx,eax
  40053f:	48 d3 e2             	shl    rdx,cl
  400542:	48 89 d0             	mov    rax,rdx
  400545:	48 89 45 f8          	mov    QWORD PTR [rbp-0x8],rax
  400549:	48 8b 45 f8          	mov    rax,QWORD PTR [rbp-0x8]
  40054d:	48 89 c6             	mov    rsi,rax
  400550:	bf f4 05 40 00       	mov    edi,0x4005f4
  400555:	b8 00 00 00 00       	mov    eax,0x0
  40055a:	e8 a1 fe ff ff       	call   400400 <printf@plt>
  40055f:	b8 01 00 00 00       	mov    eax,0x1
  400564:	c9                   	leave  
  400565:	c3                   	ret

In the disassembly, DWORD PTR [rbp-0xc] is the stack location of local variable y, which is initialized to 64 (0x40), while QWORD PTR [rbp-0x8] is variable x. Before the shl instruction, gcc first moves the shift amount into CL register and the shift target into RDX (Although EDX is actually used, this is an optimization based on the specification that higher 32 bits of a x86-64 register will be cleared when the lower 32 bits are loaded with a new value). In this case, it is the hardware ALU, instead of gcc, that evaluates the expression. As expected, bit 6 and 7 of the shift amount, which is 64, are masked off, resulting in the actual amount seen by the ALU being zero.

The eccentric behavior of gcc is not necessarily a bug. In fact, in C language specification, it is cleared stated that undefined behavior would occur, when the second operand of a shift operation is negative or when it exceeds the length of the type. The compiler is therefore given full priviledge of deciding the outcome of such an operation.