Knowing Your Hardware ALU Shifter When Generating 64-bit Bit Masks
Yesterday I was very confused when one of the unit tests in a paper’s project failed. Both the unit test and the code to
be tested are extremely simple such that no one would expect a failure to occur.
The code to be tested is a one-line macro for generating 64-bit masks (of type uint64_t
) using bit shift and decrement,
as shown below. The unit test simply enumerates all possible cases, and checks the output. bit64_test()
is another
macro that checks whether a bit is set or not on a given offset. If the bit is set, it returns one. Otherwise it returns
zero.
Mask Genetation:
Unit Test:
With the assistance of gdb, it took me another 20 seconds to figure out that the failed case occurred when the value of
test variable bits
is 64, meaning that all bits in the output uint64_t
should be set to one.
This finding, however, puzzled me even more: How could this be a failure?
When the macro argument num
equals 64, 1UL << num
should output zero, since this piece of code is compiled on a
64-bit x86-64 architectire, where the native register size is 64 bits, and the compiler will map this macro to
an shl
(or sal
, which is essentially the same, since left shifts do not deal with sign bits)
assembly instruction. Left shifting 0x1UL
by 64 bits will just result in the only bit being shifted out, which
outputs zero.
To confirm my reasoning, I wrote another one line test as follows:
Unit Test:
The output of this one line test is 0x0 0xFFFFFFFFFFFFFFFF
as expected.
So the question is, why the original macro failed, but a manual expansion passed the test?
After some investigation, here is the explanation: Intel x86-64 architecture specifies that, for a 64 bit shift instruction,
the explicit or implicit second operand, which is the number of bits to be shifted, will be truncated before sending to
the ALU. In other words, due to the fact that the native word size is 64 bits, the ALU can handle a shift amount of as
many as 63, despite that any 8-bit value (stored in CL implicitly, or given as immediate number explicitly) can be supplied.
Bit 6 and 7 will always be masked off before the ALU performs the shift.
In our case, this translates to 0x1UL << num
outputting 0x1UL
unchanged when the value of num
is 64, since the
ALU will only see 0 after the 6th bit of value 64 is masked off.
Interesting enough, a side note in the architecture specification states that in the original 16-bit 8086 design, the shift amount is not masked, meaning that the hardware shifter bahaves more consistently when the shift amount exceeds the native word bit length. Starting from 80286, the masking feature is added to ALU to reduce instruction latency, and is kept even when the hardware executes in virtual-8086 mode, resulting in a major source of incompatibility. Any 8086 era software that generates bit masks assuming the older ALU shifter behavior will fail to execute correctly in some cases.
As for why the second test case gives the correct answer: When constant numbers are used in an expression, the compiler will always evaluate the constant expression at compilation time. Unfortunately, gcc’s constant evaluation is not properly programmed to match hardware behavior, although it does prints out a warning saying that the constant shift amount exceeds the bit length of the source operand:
To confirm that gcc constant evaluation does not match hardware behavior, I wrote a test program as follows:
The resulting disassembly code after compiling this with default arguments to gcc is as follows:
As you can see, the value of 0x1UL << 64
has been evaluated at compilation time, which is zero. [rbp-0x8]
is just the
stack location of local variable x
. A different result would occur, if you change the above test code to the following:
The output of this code snippet is 1
instead of 0
, with the disassembly being:
In the disassembly, DWORD PTR [rbp-0xc]
is the stack location of local variable y
, which is initialized to 64 (0x40),
while QWORD PTR [rbp-0x8]
is variable x
. Before the shl
instruction, gcc first moves the shift amount into CL
register
and the shift target into RDX
(Although EDX
is actually used, this is an optimization based on the specification that
higher 32 bits of a x86-64 register will be cleared when the lower 32 bits are loaded with a new value).
In this case, it is the hardware ALU, instead of gcc, that evaluates the expression. As expected, bit 6 and 7 of the
shift amount, which is 64, are masked off, resulting in the actual amount seen by the ALU being zero.
The eccentric behavior of gcc is not necessarily a bug. In fact, in C language specification, it is cleared stated that undefined behavior would occur, when the second operand of a shift operation is negative or when it exceeds the length of the type. The compiler is therefore given full priviledge of deciding the outcome of such an operation.