Lukewarm serverless functions: characterization and optimization

- Hide Paper Summary

Paper Title: Lukewarm serverless functions: characterization and optimization
Link: https://dl.acm.org/doi/10.1145/3470496.3527390
Year: ISCA 2022
Keyword: Serverless; Prefetching; Jukebox; Function Keep-Alive; Instruction Cache

Back

Highlights:

Serverless functions suffer “lukewarm execution”, especially cache thrashing for instruction data, as a result of short execution time, long invocation interval, and interleaved execution of different functions on the same core.
Cache thrashing for instruction data happens mostly on LLC, because most functions have several KBs of instruction footprint, which is too large to be fully stored in higher levels of the hierarchy anyway.
This issue can be dealt with using a simple logging mechanism that logs L2 miss pattern in the unit of 1KB regions. Misses on the same region are coalesced into the same entry of the hardware log buffer. The log buffer is evicted into the main memory for later usage.
The prefetching can be as simple as replaying the instruction access pattern from the previously recorded log. There is no need for rate limiting or throttling because the L2 cache is assumed to be big enough.

Comments:

The experiments in section 2 are conducted on a real machine with 256KB L2 cache, but the final evaluation is conducted on simulated 1MB L2 cache. This creates some minor confusion, because in section 2, it is concluded that L2 is too small for holding the full instruction working set for both back-to-back and interleaved execution. Then in later sections it is claimed that prefetching into the L2 cache is sufficient because L2 is big enough.
I am pretty sure this question has been asked by one of the reviewers, but how does Jukebox interplay with address space randomization? This is especially critical on cloud settings, because you cannot simply turn it off.
How does Jukebox work with multithreaded functions? Does each thread have its own log (unlikely)? If not, then how would the hardware coordinate? I would image the OS only collects the trace on one of the threads. And most serverless functions are not multithreaded anyway.
Although a little bit unrelated to the topic itself, but what is the policy for refreshing the collected trace? I image there would be a TTL. Or, even simpler, just let it be torn down with the container process.

This paper proposes Jukebox, an instruction cache prefetching mechanism designed to reduce instruction cache misses for short-lived serverless functions. The paper is motivated by the three distinctive properties of serverless functions, which are not observed on other types of cloud services. First, serverless functions generally have short execution time, in the unit of milliseconds, and small memory footprint, which is usually within a few hundreds of megabytes, due to the fact that most functions are deployed for web services, and they only perform simple tasks such as HTML rendering. Second, different serverless functions often co-reside on the same physical machine and share the hardware resource. This is natural outcome in today’s cloud computing paradigm as service providers adopt virtualization technologies which make it possible to subscribe the same physical machine to many unrelated service instances. Lastly, intervals between function invocations can be much longer than the function’s lifetime. This is especially true when function’s lifetime is short.

The three observations, combined together, brings about an issue known that as cold start problem. When a function is short and frequently invoked, the overhead of initializing and tearing down the function container cannot be well amortized over the function’s execution period unlike the case of traditional long-running services. To address the issue, function service providers have already adopted a technique known as function keep-alive, which maintains “warm” instances of the function containers in a process pool. Function invocations are serviced by an already initialized process in the pool without paying the initialization cost, and when a function completes execution, the container process is returned to the pool without tearing it down, thus avoiding paying the extra cost that container initialization and teardown. A container process is destroyed when the function is not invoked for a while in order to free up the resource it consumes.

This paper observes that, however, even with function keep-alive, the instances executing on a shared platform with high degrees of interleaving still suffer from performance degradation, compared with executions conducted back-to-back without any interleaving. The paper calls this phenomenon “lukewarm execution”, and attributes the performance issue to cold processor states as a result of over-subscribing the processor with many different instances between two invocations.

The paper further conducted experiments comparing the execution’s CPI between interleaved and back-to-back, and makes the following conclusions. First, the instruction footprint of functions are typically several hundreds of KBs, showing low variance between different functions. Second, most of the instruction footprint are common on repeated invocations of functions, expect a few outliers. Third, performance counters suggest that the main bottleneck during execution is frontend instruction fetch latency. Further investigation into cache misses reveals that the cache hit rate for instruction fetches in all levels of the hierarchy are uniformly low, when executions are interleaved. Meanwhile, when executions are back-to-back, L1 and L2 still demonstrate high miss rate due to the working set size exceeding maximum cache size, but the LLC has lower misses. These two combined together indicate that last-level cache thrashing on instruction data is a major source of bottleneck that causes slowdowns known as lukewarm execution. Besides, since processors are over-subscribed to many functions, the total instruction footprint of which is much larger than the LLC, attempting to keep all instruction data in the LLC will not work, necessitating a different mechanism.

The paper therefore proposes Jukebox as a record-and-replay instruction prefetching mechanism to alleviate the negative effect of LLC thrashing on instruction data. Jukebox works as a two-stage process. In the record stage, the instruction access miss history is logged by a special piece of hardware into the main memory, thus saving the access pattern for costly instructions. Then in the replay stage, the instruction access log is loaded by the prefetching hardware, and instruction blocks are fetched into the L2 cache. The paper suggests that it is better to prefetch into the L2 cache, instead of prefetching into the L1, as the typical instruction footprint is too large to be stored entirely in the L1. We next describe the two stages in more details.

The record stage begins when a function is started, and there is no instruction log for the function type. The OS kick starts the record stage by preparing a small memory area for storing the access log, and passing the pointer to the record hardware co-located with the L1 cache. Then during execution, whenever an instruction access misses the L2 cache, the record hardware adds the access into a hardware access log called the CRRB. The CRRB is a FIFO queue where each entry stores the access history of a 1KB region. The entry consists of a tag address, which is aligned to 1KB boundaries, and a 16-bit vectors where each bit represents a cache line being accessed in the 1KB region. The CRRB is implemented as a small associative CAM (e.g., 16 entries). When a miss occurs, the miss virtual address is used to lookup the CRRB. If an entry is found, then the corresponding bit is set. Otherwise, the oldest entry is evicted and appended to the main memory log, and a new entry is allocated at the tail of the CRRB with the tag being the miss address. Note that CRRB logs the ordering of the first access to regions, while chooses not to preserve the ordering for accesses within a region. Entries evicted from the CRRB will never be loaded back. Therefore, it is possible that multiple regions with the same tag address exist in the main memory log.

If, however, the function already has an access log recorded earlier, the OS can skip the record stage, and directly start replaying when the function begins. The OS will initialize the prefetcher hardware with the pointer to the main memory access log for that function type (which is maintained either in a per-type table, or in the task struct of container processes). Then as the prefetcher kickstarts, it keeps reading the log entries from the main memory log, and for every cache block that has the bit set during the record stage, issues a fetch request to the LLC using the cache block’s address. This process is repeated for every log entry as quickly as the cache system can sustain, and there seems to be no throttle for the purpose of rate limiting.

The paper notes that both record and replay stages use virtual addresses to prevent aliasing problem, which can become an issue if the OS moves around physical instruction pages. As a result, virtual addresses must be passed to the record hardware via the MSHR. In addition, the prefetcher should also use the instruction TLB to translate virtual addresses in the collected trace to physical addresses, before issuing prefetch requests. This process also prefetches the translation entries that will be used later into the I-TLB, and it may actually benefit performance.