STRAIGHT: Hazardless Processor Architecture Without Register Renaming

- Hide Paper Summary

Paper Title: STRAIGHT: Hazardless Processor Architecture Without Register Renaming
Link: https://ieeexplore.ieee.org/document/8574536
Year: MICRO 2018
Keyword: Register renaming; Straight; Microarchitecture

Back

Highlights:

The introduction section actually covers two different implementations of register renaming, which is good because most lectures (including some books) do not make such distinction.
The design is similar to data-flow machine, in a sense that instructions encode dependencies directly w.r.t. other instructions rather than using the abstract “architectural register name” as data-flow medium. The difference is that there is still a strict program order defined and most part of the backend and fetching unit does not change. Also programming is easier than data-flow machine.
Similar to TCAM-based renaming, this scheme also combines register allocation with assignment.

Questions:

The SP register can only be incremented/decremented by a constant amount which is quite over-restrictive, because in practice we may need to allocate a variably sized array on the stack
I don’t like the concept that the return address needs to be saved to the stack before entering a loop simply because the dynamic distance from the return inst. to the return address register is unknown in advance. Maybe this architecture can be extended as a hybrid one: A small set of registers is provided to hold values that are allowed to be updated in-place. No renaming is needed for these scratch registers since their usages are relative scarce compared with data-flow style value passing.
This paper did not state very clearly what assumptions are made to the RISC OOO superscalar pipeline. When I was reading I just assumed it was Tomasulo’s Algorithm and was confused trying to understand why the MAX_RP is computed that way.
Instructions that do not provide a value is also allocated a physical register for simplicity of inferring RP value for ROB entries. This may allocate more physical registers than actually needed. I can see this not being a major problem because most instructions do produce a value (except stores).

This paper proposes Straight, a novel superscalar microarchitecture implementing out-of-order (OOO) execution without register renaming. In ordinary architecture, in order to schedule instructions that have anti-dependencies out of the program order, registers must be renamed to avoid accidental overwriting of register contents. For example, if instruction I1 reads from register X, while a later instruction in the program order I2 writes register X, if I2 is scheduled before I1, then when I1 executes, it will read a value produced by I2, which violates the invariant that single threaded execution must be equivalent to serial execution.

Regular register renaming schemes usually use a mapping table implemented as RAM or TCAM. There are two ways of designing the mapping table. In the first scheme, the mapping table is a RAM, in which every entry corresponds to an architectural register, and the content of the entry is the name of the physical register it maps to. Every time a new instruction passing the decode stage overwrites an architectural register, a new physical register is allocated, and the entry is updated to the new physical register. When an architectural register is used as source operands, the mapping table is accessed and the current name of the physical register is returned. To manage physical registers, an extra free list is also added to allocate unused physical registers. Physical registers are only released when no future instruction could be reading it, using either reference counting, or epoch-based mechanism (i.e. the register could be freed after the instruction that caused its being overwritten in the mapping table commits). The second scheme of register renaming uses a TCAM whose size equals the number of physical registers. Each physical register corresponds to an entry in the TCAM. TCAM entries store the current architectural register that uses this physical register. No free list is maintained, because free physical registers can be obtained by acquiring a physical register not allocated to any architectural register. On instruction decoding, the name of the architectural register is used to query the TCAM, which obtains the name of the physical register it is allocated with. On a register write instruction, a free entry is obtained from the TCAM, and then the name of the architectural register is stored into the free entry. Note that physical registers cannot be freed immediately after they are unmapped. The register recycling rule is identical to the rule of RAM-based design.

Both renaming schemes need to consider the case when a speculation path fails and have to be rolled back. In this case, simply nullifying their instruction buffers in the pipeline is insufficient, because architectural state changes such as allocations in the mapping table should also be rolled back. With the first RAM-based design, the paper suggests that a reorder buffer (ROB) walker should be invoked and undo all changes made by the speculation path (the ROB entry should record the before-image of the entry it affects for such undo operations). In the second TCAM-based design, since the entire TCAM can be snapshot in a single cycle quite efficiently, and that there is no extra data structure to maintain, the processor may take a snapshot on every point of speculation, and just restore the snapshot back to the mapping table. In this scheme, it takes constant time to restore the snapshot, but as a trade-off, the storage required to store these snapshots grows linearly with the maxmimum allowed depth of speculation.

Based on the above two possibilities of implementing register renaming, this paper identifies three potential problems of the current scheme. First, in order to keep the OOO backend busy, it is expected that the frontend issue several instructions into the instruction window every cycle. This, however, means that the mapping table needs to be accessed and updated by multiple instructions in the same cycle, implying a multi-ported design whose complexity grows exponentially, limiting the issue width of the pipeline. In addition, if these instructions have dependencies, the updates they made to the mapping table should be as if they were performed sequentially. This requires extra priority handling circult. For example, when if two instructions need renaming, they cannot be allocated the same physical register. The second problem is recovery cost. On a RAM-based design, recoverying from a mis-speculation takes time proportional to the size of the mis-speculated path, because instructions are undone one-by-one with a ROB walker. On a TCAM-based design, one snapshot is taken on each speculation, which requires significant on-chip storage to buffer these snapshots, given that the number of physical registers are generally large on modern platform. The last problem is that the mapping table is accessed in every cycle where a register is accessed. This contributes not only to total power consumption, but also makes frequency scaling more difficult.

To solve the problem of register renaming, instead of using fixed name for registers (which are abstracted away by hardware anyway), Straight proposes that instructions explicitly specify source operands using an offset from the current instruction back to the producer instruction the dynamic instruction flow. Two major changes are introduced into the ISA. First, instructions no longer specify a destination register, because the “destination” of the instruction is encoded by the position of the instruction in the ROB. Second, instructions refer to the result produced by other instructions using a relative offset in the dynamic flow, which can only be determined at runtime. The second property complicates code generation, since in some cases there can be producer instructions in different execution paths.

Straight assumes a RISC-like OOO superscalar pipeling model. Instructions are fetched and decoded in-order, which are then pushed into the reorder buffer (ROB) in program order. Operands are only read from the physical register file at the cycle the instruction is issued, and results are written back into the register file at the last stage (retirement stage). In particular, it seems that Straight does not assume a Tomasulo execution model in which completed instructions directly forward its result to waiting instructions in the window (so an instruction can only have its operands ready when all dependent instructions retire). Instructions can retire when they are at the head of the ROB.

Straight works as follows. At decoding stage, instructions are assigned physical registers beased on source operands. The decoder maintains a register pointer (RP), which points to the next physical register to be allocated. Since most instructions will produce a value, every instruction is allocated a register by incrementing the RP when it is decoded. The physical register ID is the value of the RP when the instruction is decoded (although the paper did not specify what the physical register value is if the instruction does not produce a value, such as stores). This way, operand reading is an easy task, because the physical register of previous instruction in the dynamic flow can be inferred from the offset. To elaborate: If the current instruction’s physical register ID is k, and the operand offset is x (which means that the operand is produced by the x-th instruction counting backwards from the current instruction in program order), then the source operand is from physical register (k - x), allowing wrap-around. One notable advantage of this design is that the backend of the pipeline is unchanged at all. All renamings happen at decoder stage, and after that the source operand of instructions are represented by the physical ID number.

All general-purpose registers except the stack pointer are renamed in the method described above. The stack pointer (SP), on the other hand, can be updated in-place by SPADD instruction. The SPADD instruction is executed when it is decoded (but still we put it into the ROB for recovery from mis-speculation), and every instruction’s ROB entry contains the SP value when the instruction is decoded. The SP value, together with the PC and RP stored in ROB entries, are used to restore the context on a mis-speculation.

When a mis-speculation is detected, the first instruction in the mis-speculated path is located. The ROB entry is used to restore the state. In particular, we copy the PC, SP and RP register value back to the corresponding registers, and then flush the pipeline. This process only takes constant time, and is extremely simple to implement. In practice, to reduce ROB storage overhead, two optimizations can be applied. First, the RP does not need to be stored for every ROB entry, since this pointer will be incremented for every instruction. The ROB only needs to maintain the RP of the first entry, and the rest can be inferred from the entry index. Second, the paper observes that SP is only modified at the beginning and the end of a function as stack frame allocation/deallocation (which is not always true but the paper does not talk about how to realize common constructs such as variable sized stack array). Most SP values in the ROB, therefore, are identical. To leverage such value locality, the paper proposes using a TCAM-based small mapping table which only stores SP changes, i.e. maps ROB entry index to an SP value only if the ROB entry updates the SP. For ROB entries not in this table, their SP values can be inferred from the table by an inexact search (which returns the most recent ROB entry that modifies SP).

RP wraps around when all physical registers have been allocated. The value of RP begins from zero and will be incremented for every new instruction, which introduces aliasing. Since the size of ROB is fixed, alisaing can be harmless if the re-use of the same register is larger than the threshold. Imagine the case in which aliasing can be harmful: instruction i produces a value which is consumed by instruction j, and a later instruction k re-uses the same register as instruction i, and has no dependency to both i and j. In this case, instruction j could read the wrong value if instruction i is scheduled first, and then instruction k is scheduled. Since i and k use the same destination register, when j executes, it reads the value produced by k, rather than i. In the worst case, i is at the head of ROB, and j is at some distance away (x instructions), while k is at the end of the ROB. After i has been executed, to avoid instruction k from being scheduled, the next (x + ROB size) must not contain instruction k, because otherwise it is always possible that k is scheduled before j (recall that instructions only access operands the same cycle they are scheduled). The size of ROB, therefore, can only be at most the number of physical registers minus the maximum possible distance between a consumer and producer.

To solve the problem of consumer and producer being two far away from each other, Straight introduces RMOV instruction which does nothing but only moves operands between registers. The RMOV instruction reads from its only source operand and then writes into the destination, which serves as a “relay” of value. The RMOV instruction is also useful when two basic blocks merge. If a variable is written in both blocks, and a third instruction reads from the variable, it is unclear what is the distance from the third instruction to the producer instruction, because the dynamic instruction flow from the two paths may result in different distances. If such a case is identified, during code genetation, the compiler will insert RMOV instruction at the end of the basic block, such that after a merge, the live variables (i.e. registers that are still referenced) will be at a fixed distance from the consumer instruction in the after-merge block.