Cerebros: Evading the RPC Tax in Datacenters

- Hide Paper Summary

Paper Title: Cerebros: Evading the RPC Tax in Datacenters
Link: https://dl.acm.org/doi/10.1145/3466752.3480055
Year: MICRO 2021
Keyword: RPC Tax; Accelerator; Cerebros

Back

Highlights:

RPCs are prevalent for microservices, but their overheads are large compared with the normally simple functions.
RPC overheads can be decomposed into three parts: Header parsing, function dispatching, and payload manipulation. The last part is usually the most expensive operation in RPC handling.
An accelerator can be attached to the NIC to read incoming RPC messages (particularly headers), and overlap its operation with the receipt of the message.

Comments:

How does the accelerator know how much buffer space to allocate? TCP does not have the field to specify the length of the message, so it must be embedded within the message. In this case, the accelerator must be able to read the first few packets (hopefully the first packet) before receiving the rest, in order to obtain the message size, and allocate memory. I am not saying this is impossible. I am just saying that this can be more complicated than what the paper suggests, and would require tight coupling between the accelerator and the NIC.
Is the 4x compression ratio the actual theoretical upper bound? For integer arrays that is for sure. But for pointer-based data structures, can you reliably deserialize them without creating any fragmentation (especially in the NIC’s on-board memory)?
A malicious sender can exploit the buffer allocation restriction and keep sending huge messages to monopolize the receiver buffer. This would cause other messages to be constantly NACK’ed. This, of course, could happen in any settings, but in this design, since we allocate 4x more storage, the DoS is more likely to happen.
Is the function pointer and the schema table part of the context? In other words, is it virtualized? If yes, then different threads could register different RPC family, but context switch would have a bigger overhead. If no, then this is useless in a cloud environment, which is the common case for microservices, because users of the cloud are not isolated.

This paper proposes Cerebros, a hardware accelerator that offloads the responsibility of RPC handling from the CPU to an NIC-attached module. The paper is motivated by the high performance overhead and ill-suited nature of RPC handling, and it proposes to offload part of the RPC handling, i.e., header parsing, function dispatch, and argument deserialization, to the network layer with an interface featuring hardware-software interplay.

As the microservice architecture becomes a major software engineering trend, the efficiency of Remote Procedure Call (RPC) has reemerged as an important factor in the overall performance of microservices, since the architecture divides software components into simple, standalone modules, each running in their own process address space as an independent unit, hence achieving functional and failure isolation. Function calls between modules, which was done with a single call instruction, is now replaced with more heavyweight RPC invocations. The paper points out that the overhead of performing RPCs in datacenters, dubbed as the “RPC tax”, can be as much as 40% - 90% of the total execution cycles. This situation can only become worse, because: (1) Microservices can also perform recursive RPC calls, such that the RPC tax will multiply and be propagated at all levels of the recursive call; (2) Microservices usually implement simple functions to limit the complexity of a single module. The RPC overhead, however, is rather static, and does not scale with the complexity of the function. As functions become smaller, RPC tax will only continue to rise; (3) Existing hardware already optimizes the transportation layer protocols, which further highlights the RPC overhead as the bottleneck in total execution cycle; and (4) Datacenter applications already suffer from instruction supply problems due to binary size bloat (which is a result of static library linking). Adding RPC into the control path will only aggravate this issue as RPC introduces a few more software layers.

The paper further conducts experiments to obtain a detailed decomposition of the overheads. The paper identifies three major tasks in the RPC handling process (other than executing the function itself, which the paper does not attempt to optimize): Header parsing, function dispatching, and payload manipulation. Header parsing refers to the process of identifying the function type, message type, etc. Judging from the figure presented in the paper, this part only incurs minor overhead. Function dispatching refers to the process of resolving the function ID or the string name of the function to the actual function pointer on the host machine. Although this process is likely no more than a single table lookup, the paper mentions that the final function call is an indirect jump that is hard to both prefetch and predict, which can also cause significant overhead on the microarchitectural level. In the last step, the function (or the function handler) reads the payload of the RPC message, and deserializes the function arguments to build the in-memory objects being passed as function arguments. According to the figure, this stage incurs the most overhead, but mostly due to simple data movement, integer sign extension, etc., which is ill-suited for CPU, and can be conveniently offloaded to specialized hardware.

Although not pointed out explicitly, there are a few challenges in implementing such an accelerator. First, the accelerator must be able to complete the whole process from header parsing to data manipulation, without involving the CPU. One obvious but sub-optimal design is to only let hardware parse the header, extract the function ID, and then pass the ID to the CPU. The CPU is responsible for locating the function, and initiate the accelerator for the second time for data manipulation. The paper points out that this design, despite its simplicity due to being stateless, is not particularly useful, because the CPU is still involved in the handling process, and the communication cost of two rounds of message change can easily cripple performance. The second challenge is to determine whether the accelerator should be on the NIC side, or the CPU side. Although the latter may seem more scalable, it incurs huge silicon overhead, since each core must be equipped with an instance of the design. Besides, having a centralized accelerator at the NIC side also enables CPU affinity policies, i.e., RPC calls may be configured to only be dispatched to certain cores to maintain better locality, which is impossible with a distributed design.

We now describe the baseline hardware on which the accelerator is implemented. The accelerator co-locates with a NIC, which already implements a hardware version of the transportation layer protocol. TCP packages will be ACKed, sequenced, and reassembled by the NIC alone without invoking the OS level protocol stack. The NIC is assumed to have a large on-board memory, which is addressable to the CPU as well, for buffering packets that have not been assembled. In order to send or receive messages, the CPU threads register Queue Pairs (QPs) to the NIC, and the NIC will put the address of completed messages in the queue for CPU to fetch. The CPU polls on the QP for new messages to arrive, and enqueues outgoing messages into the QP for sending.

With the RPC accelerator, the NIC must reserve space not only for transportation layer packets, but also for deserialized data. The paper suggests that the accelerator should read the message head, once it is received, and reserve memory that is 4x the size of serialized payload, leveraging the observation that the wire format is at most 4x smaller than the original, uncompressed data. If the buffer allocation fails, then the message needs to be NACK’ed, indicating that the resource has run out on the receiving end, and it is up to the high-level policy to handle the NACK. Then, as the rest of the message is being transferred, the accelerator loads the function pointer to be called and the argument schema for deserializing the payload. Both pieces of information is stored in a lookup table, which is initialized by the RPC library. The schema contains information on reconstructing the in-memory object which are to be used as function arguments from their wire format, and the paper does not specify any particular scheme and wire format (so maybe just use a general one). This process can be overlapped with the receipt of the rest of the message, and hence has no latency overhead. After the message is fully received, the accelerator then deserializes the payload using the schema just loaded into its internal state machine into the buffer just allocated.

After the payload is deserialized, the accelerator can choose to cast the RPC request, together with the function pointer and a pointer to the argument storage to a core for execution. The paper suggests that requests could have affinity to a certain core, if a request of the same function has been scheduled on the core recently, to maximize the locality benefits. To track per-function core affinity, the accelerator maintains a CAM table in which each core has an entry storing the most recent function that was called. If an incoming function ID matches any of these, the function will be dispatched to that core. Otherwise, cores are selected in a round-robin manner to spread the load evenly.