SAND: Towards High-Performance Serverless Computing

- Hide Paper Summary

Paper Title: SAND: Towards High-Performance Serverless Computing
Link: https://www.usenix.org/conference/atc18/presentation/akkus
Year: USENIX ATC 2018
Keyword: Container; Serverless; SAND

Back

Highlights:

Function chaining is an important use case of serverless, which can be optimized in two ways. First, functions within applications are not strongly isolated as they need to collaborate anyway. These functions can therefore be run in the same container with process-level isolation.
Second, chained functions can be called locally as long as the function is deployed on the same worker node.
Locally called function chains may not preserve atomicity as the call information is solely within a single node, which will be lost on a crash. This can be solved by sending a message to both the local and the global queue. When the content of the local queue is lost due to a crash, the global queue still contains the full information on completing the call chain.

Comments

How to precisely provide “only once” semantics to function chains? I guess either the global queue is used as a serialization point, or there is some versioning mechanism to ensure that each instance of a function chain is only called exactly once.
The paper should explicitly point out that in order for the crash recovery mechanism to work, each instance should be provided with enough information about the topology of the remaining function chain (which may be derived implicitly from the argument itself, but this is not always true).

This paper proposes SAND, a serverless framework with reduces cold start latency, and is specifically optimized for function chaining. The paper is motived by the fact that existing serverless platforms suffer from cold start latency, and existing solutions of using warm-up instances will significantly increase the resource consumption. Besides, function calls from the internals of an application, namely, chained function calls, are handled no different from external requests, which incurs unnecessary performance overhead. This paper addresses these two issues in SAND by using application-level sandboxes and local message bus, respectively.

The paper observes that existing service providers use either containers or virtual machines to isolate function instances. In the simplest case, each function instance is mapped to one instance of virtualized process, which executes to the end, and then the instance will be destroyed. This naive approach often incurs huge cold start latency, mainly because of the initialization cost of the virtualization platforms, and the cost of setting up the language environment such as installing and importing libraries. It is suggested in the paper that the overhead can be as large as a few seconds, or even tens of seconds.

To optimize the cold start latency, many service providers, on their worker nodes, maintain a pool of already initialized instances. These instances are scheduled to serve incoming requests, and when the execution completes, they are returned to the pool instead of being destroyed. This way, the instance will be kept in a warm state, with all the system components and libraries already loaded, which eliminates the overhead of initializing them. The paper points out, however, that these idle instances will still consume system resource such as memory, and therefore, the improved cold start latency still come at a cost.

The other issue faced by today’s serverless framework is to support efficient function chaining. In the model assumed by this paper, an application consists of a few types of functions, each of them being called a “grain”. Grains may serve a request solely by their own, or more generally, several grains may collaborate together to finish a complex task, which is called “function chaining”. In function chaining, when a grain has finished execution, it will invoke one or more successor grains, and pass the output of the current execution to them. The invocation relation between grains for a particular request may be static or dynamically generated. In either case, it is assumed that the current grain always has full information about the execution path, such that execution could resume and complete even if a host crashes, and the function invocation is scheduled elsewhere for re-execution. In other words, the arguments to a function must encode complete information of the rest of the function chain. This property, although not pointed out explicitly in the paper, is critical for crash recovery, which is what the paper focuses on.

Existing solutions to function chaining is just to treat them as regular requests, and forward chained functions to the global message queue located at the frontend API gateway as if these requests came from the external world. This approach, obviously, is sub-optimal, since it takes a full round-trip time from the worker node to the frontend node, plus the processing time of the API gateway just to get a function invoked. In theory, the function could be invoked locally, if one is deployed on the same machine, saving all the above latencies except the minimum overhead of a few local IPC messages. Besides, none of the existing frameworks handles the atomicity of chained functions, which is rather important, as the chained functions must eventually produce an output to avoid leaving the client side hanging. This requires some mechanism to ensure that a function chain, once started, can always be executed to the end, even if one or more worker node crashes.

SAND addresses the first issue, that is, the seemingly inevitable trade-off between cold start latency and resource consumption of idle instances, by lowering the degree of isolation within an application. SAND recognizes that true isolation is only a hard requirement between applications, as these applications are developed individually, and they do not expect to be co-located with another one. Grains within an application, on the other hand, often need to interact to complete a task collaboratively, and hence they are aware of the existence of each other. In the latter case, the paper suggests that these grains can be hosted by a single container without losing much benefits of serverless.

SAND therefore co-locates grains from the same application within the same container as different processes. Each grain function has a management process (i.e., Zygote) in the container, which has already performed initialization, loaded all the libraries, and servers as the template process. When a request for the grain is dispatched to the thread, the management process spawns new instances of the grain by performing a fork() with all the initialized states, which eliminates the container initialization and language runtime cost for all but the first function invocation in the function chain. The container does not need to be preserved, nor is a pool maintained in SAND, after the function chain is completed, as the cold start overhead of containers on the first invocation is amortized over all invocations.

To address the second issue, that is, to enable fast function chaining, SAND made two other improvements while providing an atomic interface abstraction for external callers. First, function chaining is performed locally as much as possible, meaning that the next function on the invocation chain will be scheduled on the same worker node as the previous one, as long as the situation permits (e.g., the function is actually deployed on the current worker node, and it does not significantly disrupt load balancing). To achieve this, SAND adds a local message queue in addition to the existing global message queue located at the frontend gateway, forming a two-level hierarchy of message queues. The local message queue is located at worker nodes, and it only handles function invocation requests originated internally from the worker node itself by dispatching it to another grain instance on the same worker node. Function chaining is then performed locally by sending the request to the local message queue, and if the requested function is registered on the local node, the request will be dispatched locally to the container hosting the grain without going through the global queue. This reduces both the latency of chained functions, as well as the network bandwidth between worker and frontend nodes. Outputs and arguments of the grains can also be passed with cheaper IPC mechanisms, rather than being serialized and then sent in TCP packets.

The hierarchical message queue architecture still does not address the atomicity challenge of chained functions. Imagine when a host crashed, losing all its states. The function chain being executed on that node will be unable to resume, as other nodes have no record of the chain being scheduled. User clients that originally made the request will have to wait until a timeout, which is against the design philosophy of serverless, and will hurt user experience as well. The paper hence proposes a crash recovery protocol on the two-level hierarchy by replicating the function invocation message to both the local and the global queue, making an analogy with redo logging, which is already commonly adopted as the protocol of ensuring transaction atomicity in data application. With crash recovery support, a local function invocation request is cast to both the local and the global queue, with the latter being tagged as “not completed”. The message to the global queue also contains the origin of the request as well as all necessary information to complete the remaining chain. When the grain finishes execution, the local message queue then updates the status of the request in the global queue to “completed”, which causes the global queue to delete the request. If, however, the worker node crashes during execution, which can be detected in a fairly small timeout by the infrastructure (e.g., using heartbeat messages), the global message queue will dispatch all messages marked as “not completed” from the crashed node to another worker node, allowing these function chains to be completed.