ShortCut: Architectural Support for Fast Object Access in Scripting Languages

- Hide Paper Summary

Paper Title: ShortCut: Architectural Support for Fast Object Access in Scripting Languages
Link: https://dl.acm.org/doi/10.1145/3140659.3080237
Year: ISCA 2017
Keyword: ShortCut; Virtual Function; BTB

Back

Highlights:

Virtual function resolution can be accelerated using the same branch target address buffer (BTB) mechanism as regular branch target predictions, as long as the type also demonstrates locality at an access site. Just provide the BTB with the address of the actual function to be called, rather than the address contained in the operand, and let the core pipeline’s speculation mechanism handle the rest.
In order to map virtual function calls to the function address, a hardware cache (the ICTable) is needed for fast speculation validation. The ICTable maps PCs of the access sites and type ID to the actual function’s address.
Allows software handlers to add recent function resolution results to the backend ICTable in order to provide feedback to the frontend BTB prediction mechanism.

Comments:

I do not fully understand what does “BTB miss” mean. Could the BTB even miss? I mean it is usually just a table indexed by the hashed PC address plus some extra statistics. If the BTB misses, do you simply assume that the branch is fall-through? Or you stall the pipeline until the branch is resolved? Or the fetching unit uses whatever output the BTB gives it? The paper should clarify on this.
In fact, I don’t think BTB miss should be a special case. The so-called BTB miss should just be a wrong prediction as the current PC aliases with another branch’s PC and hence share the same BTB entry. It is just handled as an incorrect prediction.
How many operands do IC_Load and IC_Store has (implicit + explicit)? I think you need dispatcher address, the object type, object base address, and the src/dst register. This is likely fine for most architectures, but I think it may also be implemented as a sequence of uops rather than individual operations (one for addr gen of the dispatcher, one for addr gen of the object pointer, one for accessing ICTable, one for memory operation)? What are the costs of these uops?
IC_Load and IC_Store do not need to use the BTB, and do not need any feedback to the BTB. The paper over-complicates the issue by saying that both instructions will use the BTB and give it feedback with the address of the next instruction

This paper proposes ShortCut, a hardware mechanism for accelerating virtual function invocations for scripting languages. The paper observes that scripting languages, which are often Just-In-Time (JIT) compiled, pay a huge overhead on virtual function invocations, since these languages support dynamic resolution of function invocation, meaning that the actual function being called on an object pointer is dependent on the runtime dynamic type of the object, which could not be determined statically at compilation time. As a solution, the JIT compiler needs to generate dispatch functions that call the corresponding implementation in the runtime based on the type of the object. According to the paper, the dispatch function is a major source of slowdowns, which account for around 22% of the instructions being executed in applications.

The paper introduces two types of dispatch functions, i.e., the global dispatch function that handles all function invocations at all access sites, and in-line dispatchers that only handle a single site. The responsibility of the global dispatch function is to map object types and access site to the corresponding function implementation (functions at the same access sites are polymorphic and have the same symbol name). The mapping is stored globally using a software mapping table, which is queried every time a function is called at an access site. To reduce the overhead, the JIT compiler may also generate in-line dispatch functions for each individual access sites that cache the result of queries for the particular call site. The paper shows three possible implementations of in-line dispatch functions. The first type, inline dispatchers, is essentially a series of “if” statements where each of the clause handles one type. At the end of the “if” statements, a special function is called to inform the JIT compiler that the type cannot be matched with the current dispatcher, and the JIT compiler will invoke the global dispatcher to find the mapping, and then re-generate the entire function together with the in-line dispatcher.

The second type is custom dispatcher, which is an individual function that is called at the access site. The internal of the dispatcher function is identical to the in-line dispatcher. The difference is that when the custom dispatcher could not match the type, only the dispatcher needs to be re-generated by the JIT compiler, which has substantially lower overhead than re-generating the entire function.

The last type is shared dispatcher, which uses a shared dispatching function on all access sites, but it still maintains per-site tables to cache previous results of global dispatcher’s query. On a function invocation, the shared dispatcher first needs to locate the table for an access site, and then it queries the per-site table to find the match. If no match is found, the global dispatcher function is called, and the result of the query is inserted into the per-site table for later uses.

ShortCut accelerates inline dispatchers by leveraging the existing Branch Target Buffer (BTB) and speculative execution mechanism. In many architectures with a deep pipeline, the branch target address cannot be obtained at early stages of execution, which is problematic for pipelined execution, as instruction fetching can only speculate “not taken”. To alleviate the problem, a BTB is added to the frontend to map the PC of the branch instruction to the target address, which is trained from past execution with non-speculative results. The execution after the BTB prediction remains speculative until at later stages of the pipeline where the target address is resolved. The speculation, in this case, is either validated, if the actual address matches the predicted address, or is rolled back, if addresses mismatch.

ShortCut extends the BTB prediction mechanism by adding a special virtual function invocation instruction, IC_Call, which is semantically equivalent to a dispatched function call, but it takes the dispatcher function’s address and the object type ID as operands. The instruction functionally serves as a branch, which also predicts the target address from the BTB, but the target address it expects is the address of the actual function to be called after dispatching, rather than the dispatcher’s address (which is in the operand). ShortCut also adds a small fully-associative cache, the ICTable, that stores the most recent dispatching results at the later stage of the pipeline. The ICTable entries consist of three fields: The PC of the IC_Call address, the type of the object, and the address of the resolved virtual function. Dispatcher functions should update the ICTable with the result of dispatching using the ShortCut ISA.

We next describe the operational details of ShortCut. On execution of the IC_Call instruction, the target is predicted as a regular branch instruction using the BTB. The execution turns speculative after the prediction. At later stage of the pipeline when the instruction is fully decoded and the dispatcher function’s address in one of the tw operands are decoded, the speculation is validated by querying the ICTable using the PC of the IC_Call instruction and the type ID in the operand. If the query indicates a hit, the correct address is retrieved from the ICTable entry, and compared with the BTB output. If these two addresses match, speculation is successfully validated, and the instructions in the speculation window can be committed. Otherwise, if the two addresses mismatch, speculation fails, and execution is rolled back to the IC_Call instruction, with the correct address from the ICTable. In both cases, the BTB is also updated with the correct address of the function to complete the feedback loop.

If, however, validation fails because the entry cannot be found in the ICTable, then the control flow is diverted to calling the dispatcher function (the address is one of the two operands), which resolves the virtual function invocation, and adds the entry into the ICTable. After the dispatcher function returns, the validation of the speculation is performed as described in the previous paragraph.

The paper further proposes a more aggressive version of ShortCut that is capable of entirely removing the virtual function invocation. This is motivated by the fact that some virtual functions only perform simple get or set operations that only read or write a single memory location. Performing address resolution and branching for these functions actually incur large overheads, and the amount of operations performed by these functions is minimum such that they can even be encoded by a single instruction. The paper, therefore, proposes that certain IC_Calls to simple get or set functions can be replaced by IC_Load and IC_Stores, which circumvent the function invocation, and just performs the read or write in-place. To accommodate these two new instructions, the ICTable is extended with an extra “simple” field indicating whether the entry represents a function invocation, or a simple operation. The function address field of ICTable entries is also reused as the offset for the read and write operation relative the object’s base address. The software dispatcher is responsible for initializing an entry as simple, if the function resolved just contains a simple load or store operation, by setting the “simple” bit and updating the offset of the memory operation in the ICTable.

On execution of IC_Load and IC_Store, the BTB is not used for address prediction, and execution always continues at the next instruction. The core pipeline treats both instruction as memory instructions, and the effective address for the memory operation is generated in later stages of the pipeline from the object’s base address, which is one of the operands, and the offset field of ICTable. No speculation is needed, because both instructions effectively function as non-branching memory operations, whose address can be determined non-speculatively. If the ICTable misses, the dispatcher function, whose address is one of the operands, will be called, which performs the actual resolution, and inserts the result into the ICTable before execution is resumed.