FaaSM: Lightweight Isolation for Efficient Stateful Serverless Computing



- Hide Paper Summary
Paper Title: FaaSM: Lightweight Isolation for Efficient Stateful Serverless Computing
Link: https://www.usenix.org/conference/atc20/presentation/shillaker
Year: USENIX ATC 2020
Keyword: Serverless; WebAssembly



Back

Highlight:

  1. Language level isolation can be done by enforcing memory safety at language level and performing static checks at IR level. Users only needs to submit the IR to the platform, which will be optimized and linked. The resulting program can thus guarantee memory isolation.

  2. Function’s virtual address space can be just a linear array as in WASM to make memory access’s bounds check easier. This is critical for memory safety check. Also the simple memory layout allows us to link multiple functions into the same binary (with proper relocation or position independent addressing), with each executing as a separate thread.

Comments:

  1. I am more interested to learn how multiple faaslets are compiled into the same binary and executed as different threads from WASM, but unfortunately the paper did not discuss this matter. From my perspective (I can be very wrong because I did not know anything about WASM except the most basic concepts), WASM programs are compiled into IRs and then code gen’ed into machine code. The memory layout of WASM programs also seems to be unable to handle multi-threading. For example, WASM binary has data at the very bottom, followed by stack, followed by the heap. How would the stack be multiplexed between different threads? I am not 100% sure what I said here is correct, so this may just be straightforward to experts and practitioners. OK, Figure 2 might be conclusive. I think they just follow the single-array memory layout, and map them to different parts of the virtual address space of the process. Thread scheduling is done by switching everything, including data, stack, and heap. But in this case, wouldn’t all absolute addresses require relocation? I guess WASM are compiled as position independent code?

  2. Again, a beginner’s question: does WASM’s binary format follow its abstract memory model? Does the binary image in the main memory only occupy a consecutive range of virtual address, with the bottommost being data, then stack, then an extendable heap? The paper seems to be suggesting so (“Since WebAssembly memory is represented by a contiguous byte array, containing the stack, heap and data, FAASM restores a snapshot into a new Faaslet using a copy-on-write memory mapping.”).

  3. The paper spent a major part discussing the OS and storage level interfaces, while leaving many important and interesting topics unmentioned. For example, why do you need shared pages between different faaslets? Is it solely to support the local object cache? Why faaslets only consume KBs of memory for non-function data? Is it due to page sharing (likely not, because container uses MBs)? How network traffic shaping is done using tc? (what is tc?) These questions are either unanswered, or only mentioned briefly without further elaboration.

  4. How are Faaslets linked into a single binary? Is there the notation of an application (like Microsoft Azure) and all Faaslets are linked into a single application? Surely you will not link unrelated functions into a single binary?

  5. What is the read-global, write-local file system? Is it the same thing as the two-tier persistent states?

  6. The message bus is only mentioned once in Section 3.1, and never again. What is the difference between the message bus and the conventional message queue in other systems? How does it work? Is it even important?

This paper proposes FaaSM, a language-level software framework for lightweight isolation between serverless functions. The paper is motivated by the relatively high overhead of OS or hypervisor level resource isolation and the inability of certain common tasks, such as sharing memory pages and/or file system. The paper proposes using language-based isolation mechanism powered by a strong memory model and static program verification to enforce isolation statically at compilation and code generation time, while relaxing the hard isolation boundaries during the runtime for better performance. The paper also proposes incorporating object caching into the serverless framework that enables low-latency access of persistent objects. Stateful serverless functions can hence be created using the object store as a communication channel.

The paper begins by identifying the two limitations of today’s serverless platforms, namely, data access overhead and container memory footprint. The former has become an issue for functions that access persistent storage, since data needs to be copied from the storage to the function over the network. Co-locating function instance with data would be difficult and against the service goal of serverless, as functions are supposed to be able to execute anywhere on the cloud, and be scaled up quickly as loads surge. These properties would be difficult to implement if functions that access data are shipped to the data store. To make the situation even worse, if multiple functions access the same data object, which is not uncommon if functions perform logically related tasks, then the same data will be copied over the network repeatedly, which wastes both network bandwidth and processor cycles. This issue is also hard to address within the functions, since functions are supposed to be stateless, and hence no state of its execution shall be preserved locally, including the data objects.

The second limitation plays an important role in the degree of multi-tenancy of the cluster, which is defined by the number of instances that can be co-located on the same worker node, which is usually memory-bound. As cloud providers tend to cache warm container instances after execution has completed to reduce the cold start latency of future instances, this problem will only get worse since memory resource consumed by a container instance will only be freed after the keep-alive period.

The paper then lists a few design goals of FaaSM. First, FaaSM should support lightweight isolation and add as little runtime overhead as possible. FaasM achieves this using a mixture of conventional container techniques, such as cgroups, namespaces, and language-level isolation enforced by static checks and program verification. Second, FaaSM should support efficient access to persistent states, thus allowing functions to communicate. This is achieved with a local cache and a remote distributed key-value store. The local cache runs within the function’s address space, and can hence be accessed with even lower latency than if the cache is a separate service running on the same machine. Third, FaaSM functions should be able to start with low latency, hence addressing the well-known cold start problem. The paper proposes a snapshot mechanism called Proto-Faaslet, which captures the initialized state of the system and saves them as a local persistent object as a shortcut. Lastly, FaaSM must restrict the set of system features and system calls that the function can use, as there is no dynamically enforced access rules. FaaSM implements this feature by linking the function with a wrapper layer, called the host interface, that only exposes a limited set of system calls, and enforce that only system calls within the host interface are used statically at compilation time. For certain system calls, the arguments are also checked in the runtime to make sure that only allowed system features are requested.

FaaSM implements each function as a Faaslet. Faaslets are developed in any suitable high-level languages, compiled into the intermediate representation WebAssembly (WASM), and then translated into native machine code by an optimizer after passing the static check. The most critical feature of WASM is its memory safety. To elaborate: WASM’s memory model is just a single flat array, and instructions can only access data as uninterpreted bytes in the array. There is no mechanism to address arbitrary memory outside of the array, and all memory references can be easily verified in the runtime by a simple bound checking. Besides, WASM code can be statically verified for memory safety. Programmers can thus compile their functions locally into IRs for testing, and then submit the IR to the platform for execution. The IR code will first be verified for memory safety, and if it passes, be translated into native machine code of the target platform. The native machine code is also guaranteed to be fully isolated from each other in terms of memory accesses.

FaaSM will link several Faaslets into a single binary (the paper did not mention why or which Faaslets are linked together), which share the same process’s address space, and each Faaslet will execute as a separate thread in the address space. As mentioned earlier, the Faaslets are fully isolated from each other in terms of memory accesses, due to the memory safety of WASM. Besides, FaaSM uses namespaces to isolate the threads from each other regarding resource naming, and assigns threads to different cgroups, with each thread having equal share of CPU time. Network traffic is regulated using a traffic shaper, but the details are missing from the paper.

FaaSM restricts accesses to system calls and system features by only exposing a small subset of system calls to the Faaslets. The paper lists these calls and categorize them into dynamic linking, memory management, networking, file I/O, and time and random numbers. These exposed system calls, namely the host interface, is implemented as a wrapper module, which is dynamically linked into the binary at runtime for best performance. The wrapper module forwards an incoming call and its arguments to the actual system call with possible argument checking (for example, to avoid invoking mmap() in an illegal manner that can break isolation).

The host interface also contains service calls that allow functions to be composed and synchronized. Functions in FaaSM can invoke each other, and block on an active invocation until it returns. Each chained function call using the host interface will return a call ID, which can be used to read the results and to block on the invocation. The paper noted that this function chaining model is similar to how threads can spawn other threads, and block on an active thread until it exits.

FaaSM also extends the WASM memory model to support, although still restricted, shared pages. The shared pages are mapped in the virtual address space of the process, which is shared and accessible by all Faaslets in the same process. In the WASM memory model, the shared pages are just an extended part of the memory, which is a flat array. Accesses to the shared area are also statically checked as other memory accesses. When multiple WASM IR objects are linked together into an executable binary, the shared area is relocated into a separate part in the virtual address space (i.e., the OS loader is told to map it to a separate region that does not belong to the linear array of any Faaslet), and accesses to the shared pages in all Faaslets are relocated to the new location. Sharing is thus achieved without sacrificing any isolation guarantee nor breaking the memory model of WASM.

FaaSM supports persistent states using a two-tier architecture. The bottom tier is a distributed object store that can be read with a key, or written/updated with a key and value. The top tier is a per-process cache residing in the shared part of the Faaslet pages, which keeps recently accessed objects locally. The host interface also provides service calls to access objects either transparently (let the cache handle the reads, if it is in the cache) or directly from the bottom tier. Locking primitives for local and global objects are also provided for proper synchronization. FaaSM does not provide any extra consistency guarantee other than what is already provided by the object store. Furthermore, caching may incur consistency issues that are not originally possible. The function developers should be aware of this, and either live with it, or implement stronger semantics.

To reduce cold start latency, snapshots, namely Proto-Faaslets, are taken from initialized Faaslet processes (developers define how the Faaslets are initialized), and stored on the disk for fast revival. Since the memory layout of Faaslets are very simple, i.e., they consists of just a single linear array divided into data, stack and heap, they can be easily tracked and dumped to the disk as a local file, and then mmap’ed back later (the paper also mentions that they can be stored as global persistent objects). Proto-Faaslets are also used for clearing the memory state of a function after it has completed to prepare for the next invocation. This is particularly helpful to avoid information leak, if the two requests belong to different users.

FaaSM tracks, as persistent global states in the object store, the set of cached objects (reported by each Faaslet process) and Proto-Faaslets stored in each worker node, and dispatches requests to worker nodes based on these info. Scheduling decisions are made with distributed schedulers. The central scheduler first dispatches requests in a round-robin manner to local schedulers running on each. The local schedulers dispatch the request to the local function instance, if the required objects and function code is already locally stored. Otherwise, the scheduler queries the global information, and dispatches the request to a peer worker that can fulfill the request.