Benchmarking, Analysis, and Optimization of Serverless Function Snapshots

- Hide Paper Summary

Paper Title: Benchmarking, Analysis, and Optimization of Serverless Function Snapshots
Link: https://dl.acm.org/doi/10.1145/3445814.3446714
Year: ASPLOS 2021
Keyword: Serverless; Keep-Alive; Caching Policy; REAP; Serverless Snapshot

Back

Highlight:

VM snapshots can reduce cold start latency, but it still incurs non-negligible overhead due to page faults and random disk accesses.
Warm instance caching is effective for frequently invoked functions, at the cost of extra memory consumption, but not very effective for infrequent functions, which will waste memory. For the latter type of functions, we can reduce their cold start latency by eagerly loading the pages that are likely to be accessed during execution, which eliminates page fault handling and makes disk accesses sequential.
Serverless functions have distinctive memory properties. First, they access a consistent working set, making it trivial to identify these pages by one single profiling run, which also allows eager page loading. Second, snapshot instances have smaller working sets than warm instances, due to the fact that some pages are only accessed during initialization and never accessed ever since. Lastly, serverless function’s page fault pattern lacks spatial locality, which makes disk accesses more random, and hence lowers the throughput.

Comments:

Do function access patterns change gradually over time? If true, then the profiling should be redone periodically. I think it is unlikely, because serverless functions are small and typically simple, so they are largely input agnostic, even if inputs are shifting.
Typo on page 8, “REAP creates two files, namely the working set (WS) file)”, one stray right bracket.

This paper presents Record-and-Prefetch (REAP), a lightweight software mechanism for accelerating cold start from a VM image of serverless functions. The paper is motivated by the slowdown caused by frequent page faults and irregular disk accesses when starting a new serverless instance from a prior snapshot on the disk. The paper observes that most function instances access a very stable subset of the working data that is in the snapshot, and proposes to first track, and then prefetch those data pages in advance at VM start time to reduce the page fault handling and disk access overhead.

The cold start latency has become a well-known problem for serverless VMs, and it consists of three causes: The VMM itself, the kernel and runtime environment, and the function’s own initialization. Cold start latency complicates the management of serverless clouds for a few reasons. First, serverless functions are only billed for the actual time it takes to execute the function body. By spending time on cold start for every function invocation, the cloud provider is essentially wasting hardware resource on computations that do not generate profit. To address this issue, current cloud providers simply keep the VM instances alive in the main memory for tens of minutes before shutting it down. Future requests that hit the same function can hence reuse the VM instance, which eliminates the unnecessary cold start. Secondly, cloud providers tend to deploy thousands of instances on a single servers in order to achieve high tenancy level. The caching policy, while effective for a single function, may become a heavy burden in terms of memory, as each VM instance can take a few hundreds of memory, and thousands of them would require several hundreds GBs of main memory, if cached, which is not realistic on today’s platforms. Lastly, according to a previous study conducted on Microsoft Azure, most functions (more than 90%) are only called sparsely, indicating that a fixed time caching policy will not work, as they are likely not being reused before being shut down, which offsets the purpose of caching.

More recent researches suggest that snapshotting might be another solution to the cold start latency problem without incurring excessive memory overheads. In snapshotting, the memory image of the virtual machine is captured after initialization has been performed, and saved to the secondary storage together with the guest virtual and physical address layout (which can be obtained from the guest OS’s page table) as a file object. The execution context is also dumped such that execution could resume at the precise point where the snapshot is taken. The next time the same function is requested, the VM is started with all of its guest physical addresses marked as invalid in the host OS’s page table. The snapshot image is incrementally brought into the main memory following a lazy loading strategy, i.e., a physical page of the VM instance is loaded only when the corresponding virtual page is accessed for the first time, which triggers a page fault, allowing the VMM to install the physical page by reading it from the snapshot file into a page frame, and updating the page table. While being seemingly attractive, the paper points out later, however, that snapshotting incurs non-negligible cost on page fault handling and disk random access, which we discuss in the following.

The paper assumes a serverless platform based on vHive, although the idea is generally applicable to other snapshot-based platforms. The platform provides a Lambda-like interface, in which functions can be addressed with an URL, both from the external or the internal, which allows functions to be composed to form more complicated program logic. The platform frontend (the control plane) consists of a load balancer, a Kubernetes cluster scheduler, and a Knative autoscaler, which are responsible for request dispatching, worker node management, and function scaling, respectively. The backend worker nodes (the data plane) execute functions in Firecracker MicroVM, which, packed with a Knative queue proxy, forms what is called a Pod. The queue proxy buffers requests dispatched to the MicroVM for them to be handled serially, and it reports the queue length back to the Knative autoscaler as the feedback for autoscaling. Each worker node has a background Container-Runtime Interface (CRI) process serving as the receiving end of the Kubernetes cluster scheduler, which is responsible for executing the scheduling decisions made by the frontend Kubernetes scheduler. The paper also mentions two daemon processes, namely Containerd and Firecracker-Containerd, the responsibility of which is to manage the lifetime of regular containers and Firecracker MicroVMs, although not many details were given.

To pinpoint the cause of inefficiency in snapshotting, The paper evaluates the latency of functions starting from a snapshot and from a warm container in the main memory. In order to better capture the overall performance impact of page fault, which mostly occurs during execution, the latency is measured as the time difference between the moment the worker node receives the request, and the moment the result is sent back to the scheduler. Frontend latency is considered as insignificant and not included. The testing platform is equipped with SSDs as local high-speed storage where the snapshot is stored, and to simulate infrequent function invocations, the OS page cache for storing recently accessed disk pages is also flushed before each cold start.

The paper reports the following results. First, even with snapshotting, cold starts are still one or two orders of magnitude slower than warm starts. Performance decomposition results show that a snapshot-based cold start consists of three stages. The first stage initializes the virtual machine, and the virtual devices. The second stage reestablishes the persistent gRPC connection, which the control channel, between the VM instance and the external world, which also takes some time. The paper does not intent to optimize these two types of overheads, and hence gives little analysis on this part. In the last stage, the function starts execution. The paper noted that function execution in a cold start instance takes significantly more time than starting from a warm instance. Further study demonstrates that the lower performance is caused by frequent page faults (thousands per invocation), which are necessary to bring the guest physical address space into the host memory. Furthermore, the paper also shows that the spatial locality of the accessed pages are low, indicating that the throughput of disk accesses cannot be fully utilized, as external storage tends to perform worse when the access pattern is more random, which is true even for SSDs.

The paper also investigates into the memory footprint of cold and warm start instances. Surprisingly, results show that cold start instances exhibit lower memory consumption, while warm instances use larger amount of memory. The paper concludes that this phenomenon is caused by pages that are only used during the OS and the runtime initialization, which is no longer accessed during execution. The on-demand lazy paging strategy in cold start instances excludes these pages from the in-memory working set, since they are never accessed during execution. The paper claims that while this observation partially justifies using snapshots for fast cold starts, as it lowers run time memory consumption, this technique alone is not sufficient to enable full caching of warm instances in the main memory. The memory consumption of co-locating these instance would still be unrealistic, due to the opportunistic cost of wasting resources on idle function instances.

The last performance study looks into the consistency of the working sets between function invocations in terms of accessed pages. The paper observes that most function instances access a rather stable set of pages across different invocations. Two factors contribute to this important observation. First, functions of the same type use the same libraries, share the same code, and most likely, due to the nature of being stateless, run similar logics on different inputs. Second, since function instances are booted from the same snapshot, they have identical initial states to begin with, which will lead to identical allocation decisions even for dynamically allocated memory.

The paper hence proposes REAP as a software component to prefetch the pages in the observed working set to get rid of the page fault handling and disk random access overhead in snapshot-based cold starts. Serverless functions can benefit from REAP from two aspects. First, REAP eagerly installs all pages that are in the working set during the initialization stage, avoiding page fault handling on these pages during execution. The eager scheme will not incur extra latency, since page faults are also on the critical path in the lazy scheme. Second, REAP stores the working set data in a separate file object continuously, and bulk loads the entire file into the main memory. This requires only one high-level I/O and mostly sequential disk access, which leads to higher disk throughput.

REAP is implemented as a standalone, user-space module, and runs as a thread within the VM instance’s address space. REAP redirects process-level page fault call back to its own handler running in the user space, such that it could both track and record page fault, and satisfy a page fault with pages allocated by itself. REAP optimizes cold start in two phases. In the first record phase, REAP tracks a cold start instance’s memory access, and records the guest virtual and physical address of the access. It also dumps the guest physical page into a working set (WS) file for later use. After execution completes, the set of virtual addresses being accessed, as well as the guest address mapping, is also saved. In the second stage, when an instance of the same type is started, REAP first loads the working set file into the main memory (on the first page fault, which is triggered artificially by Firecracker), and uses the guest address mapping to update the host page table. This process recovers the virtual and physical address layout for pages in the working set file. Additionally page faults may also occur during execution, which are handled in the same manner as before.