Memory Deduplication for Serverless Computing with Medes

- Hide Paper Summary

Paper Title: Memory Deduplication for Serverless Computing with Medes
Link: https://dl.acm.org/doi/10.1145/3492321.3524272
Year: EuroSys 2022
Keyword: Serverless; Deduplication; Cold-Start Latency

Back

Highlights:

In addition to warm and cold states, function instances can be maintained in a dedup state where its address space is deduplicated and it requires an extra restoration step to be brought up. The dedup state consumes less memory than warm state and has shorter startup latency.
The fingerprint of a page can be generated by running Rabin hash over the page, and selects the 64-byte chunk whose hash value matches a certain pattern. If an identical chunk exists on another page, then both pages will have the same fingerprint being generated, which can then be matched with a hash table.
Deduplication should take ASLR and unaligned data into consideration. These two properties make it less effective to assume that data on the same offset is likely identical. On the contrary, Rabin hash is oblivious to unaligned data as the algorithm generates a hash value for every possible window with small overheads.
Function checkpoints can be generated using CRIU. In Medes, a small subset of function instances (1 in every 40 instances) are checkpointed and used as the reference. Pages in these checkpoints are regarded as reference pages, whose fingerprint values are entered into a central repository mapping the hash values to the storage locations of the corresponding pages.

Comments:

My biggest concern is that, as the paper points out, two major sources of redundancy are shared libraries and opened files. In the former case, there is nothing to deduplicate as the OS effectively deduplicates them. In the latter case, the content of the file may have already been loaded into the OS page cache and is used by another thread. These two cases seem to involve great complications as they have to be dealt with in the kernel. I understand that the Medes design uses CRIU for checkpointing and restoration, which may already have solutions for the above scenarios, but I am still curious to learn how they are done from a high level.
It also seems to me that Medes is most effective for data segment and heap memory since they contain private memory that no other process may share. In the case of shared memory (e.g., page cache, shared library), the OS already does a good job deduplicating them.
The paper mentions that a page’s fingerprint consists of five chunks selected from the page. What if the page only contains fewer than five chunks? Besides, chunks are selected if the last two bytes match a particular pattern. Assuming an even value distribution, this mechanism will only select a chunk every 2^16 = 64KB, and five chunks will require, on average, 320KB of data.
In 4.1.2, are the chunks selected based on the last two bytes of chunk data (which is what the text seems to suggest – also there is a typo), or the last two bytes of the hash value (which is more reasonable as with Rabin-Karp algorithm)?
In Section 2.1, if you sample K bytes at “regular fixed offsets of 2K bytes”, then why bother using Rabin hash? Aren’t the K-byte chunks selected at fixed offsets as well?

This paper presents Medes, an in-memory deduplication module that enables fast startups of serverless functions. Medes is built based on the observation that serverless functions running on a cluster often share identical or similar pages, which can then be leveraged for deduplication. The deduplicated function instances consume less memory and can hence be preserved in the memory in a warm state for a longer period without causing memory pressure. Medes achieves this design goal by using process checkpoints and fingerprint hashing to identify similarities between their memory images.

The Medes design is motivated by the classic problem of reducing cold-start latency in serverless computing. Cold-start latency has become a major problem as serverless functions are typically short in execution time. As a result, the relatively heavyweight process of starting the function instance and setting up the execution environment can occupy a significant part of function execution. To reduce such latency, prior works proposed using Keep-Alive policies that keep function instances “warm” in the system after the function completes. A warm function instance can be invoked much faster than a newly started “cold” instance, at the cost of higher memory pressure since the warm instances are maintained as idle functions containers in the main memory. The paper points out that Keep-Alive policies hardly work well in practice, due to the unpredictable nature of function arrivals.

Instead of simply keeping existing function instances in the main memory as warm instances, Medes proposes to deduplicate these warm instances using a small number of base instances as references. Consequently, these instances, after deduplication, consume less memory and hence cause less memory pressure, at the cost of a slightly longer latency to bring them up by performing the restoration. To achieve this goal, Medes introduces a third state of function instance, called the “dedup” state. Under the dedup state, functions are still kept in the main memory, but their address spaces are deduplicated for reduced memory consumption. When a function in the dedup state is to be brought up by an incoming request, Medes will first restore the function’s address space image by restoring its content from both the reference pages and the patches (diff that should be applied to reference pages). Functions can transit from the warm state to the dedup state under the command of the Medes controller when deemed necessary. The controller may also purge (i.e., terminate) dedup function instances completely, freeing all its memory resources, when the time that the function spends in that state exceeds a certain threshold (denoted as the “keep-dedup period”).

When a function transits from warm to dedup state, the address space of the function is deduplicated against the existing function instances. Similarly, when a function is brought up, it must be restored to its original state before execution can start. In Medes, deduplication and restoration are local operations being performed by a daemon process called the dedup agent. When a function instance is to be deduplicated, a checkpoint of the process image is first taken using the CRIU framework. After the checkpoint is taken, the local dedup agent scans checkpoint data, and for each page, computes the fingerprint that represents the content of the page. Page fingerprint is computed by running the Rabin rolling hash on a 64-byte window and selecting the 64-byte chunk whose hash value matches a certain pattern. The selected 64-byte chunks are called Reusable Sandbox Chunks (RSCs) of the page, and a total of five RSCs for each page constitute the fingerprint of the page.

The reasons that Rabin hashes are used for generating page fingerprints are (1) Rabin hash is easy to compute as it only requires a linear scan of the page to generate hash values for all 64-byte windows, and (2) Running Rabin hash at a smaller granularity than the page size adapt to ASLR and, in general, unaligned in data on the address space very well, since Rabin hash can capture data similarity even at different offsets (and is hence more widely known as a string matching algorithm).

After the page fingerprints are generated, the local dedup agent sends the fingerprints to the central Medes controller located at a gateway node for deduplication. The Medes controller maintains a
central repository of reference pages and their fingerprints, implemented as a hash table mapping hash values to reference page locations. On receiving the fingerprints, it looks up the hash table using the hash values included in the request and retrieves pages that have the same hash value. If more than one page is returned, the page that has the most matches between hash values is selected as the reference page.

After selecting the reference page, the page location is sent back to the dedup agent, and the dedup agent reads the page using RDMA from its storage location. After the reference page is read, the dedup agent performs local deduplication using Xdelta3 library which extracts the longest common subsequence between the page under deduplication and the reference page. A “patch” is generated based on the result of the comparison, which is then stored with the location of the reference page together as the deduplicated representation of the page. The patch contains the offsets and the sequences that must be applied on top of the reference page in order to
restore the content of the current page. Since the two pages are expected to contain similar contents, thanks to the fingerprinting process, the patch should be relatively smaller than a full page, and therefore it saves memory.

On restoration, the local dedup agent just performs the opposite of deduplication. It first reads the reference page from the storage location using RDMA just like in deduplication. Then the dedup agent applies patches to each reference page and hence restores the content of the deduplicated page. The restored memory image is then loaded back to the address space, after which the function instance is brought up and ready to run.

As mentioned earlier, the Medes controller keeps track of a small subset of pages to be used as the reference page for deduplication. The small subset is constituted by pages in function instance checkpoints generated by CRIU. The Medes controller generated a few checkpoints for every function type, keeping the ratio between the number of instances and the number of checkpoints at approximately 40 to 1, i.e., for every 40 instances of functions in the system, one checkpoint is generated and registered with the Medes controller. Pages in checkpoints are fingerprinted in the same manner as in deduplication, and the hash values are sent to the Medes controller together with the storage location of the reference pages (node name + local memory address) such that the controller will insert them into the central repository. To make deduplication and restoration operations fast, the reference pages are kept in the main memory and can be fetched via RDMA.

The paper also discusses operations policies adopted by Medes controller in order to achieve a balance between memory consumption and average startup latency. From a high-level perspective, the more instances are maintained in the dedup state, the less the memory pressure will be, at the cost of higher average startup latency. Therefore, the main goal of the policy is to dynamically balance function instances between the warm and the dedup states, such that the given constraints are met.