SOCK: Rapid Task Provisioning with Serverless-Optimized Containers

- Hide Paper Summary

Paper Title: SOCK: Rapid Task Provisioning with Serverless-Optimized Containers
Link: https://dl.acm.org/doi/10.5555/3277355.3277362
Year: USENIX ATC 2018
Keyword: Microservice; Serverless; OS; Process Template; SOCK

Back

Highlights:

Compared with general-purpose containers, lean containers for serverless functions can be specialized based on the special requirements of serverless by using special file system arrangements, cgroup caches, and not using certain namespaces that are slow to create and destroy. This will optimize the cold start performance of containers, but not language runtimes.
To further optimize cold start performance of language runtimes, a cache of Zygote processes can be maintained, such that a new instance of the containerized process can be created by forking from an existing cache entry, which is already initialized and with (at least part of) the required libraries imported. This is faster than initializing the container and importing the libraries every time a new instance is needed.
Usages of Python libraries have a significant skew towards only a few popular libraries. These libraries can be cached locally and bind mounted to the directory of the serverless instance, such that libraries can be installed locally, rather than from the Web.

This paper presents SOCK, a customized container framework for reducing cold start latency of serverless functions. The paper is motivated by the observation that both language runtime initialization, such as library import, and container initialization will incur long startup latency, which are jointly called the “cold start latency”. This paper addresses cold start latency using a series of techniques, including a customized “lean” container that avoids long latency system calls and relaxes certain isolation requirements not needed for serverless, a local cache of commonly used Python libraries, and using fork() to clone already initialized Zygote containers. The paper also presents a series of performance evaluations of Linux systems calls related to container virtualization, as well as various engineering details that have different performance trade-offs.

To reduce container startup time, the paper argues that container implementations for general purposes, such as Docker, provides full and the strongest isolation. Serverless functions, however, do not always require all the features provides by these containers, since serverless functions are typically small, only perform simple tasks, and do not use many of the OS features (e.g., static port assignment). The paper makes three important observations on container-related system calls, which we discuss in details as follows.

First, it is sufficient to achieve file system virtualization with bind mounting and chroot(). The former is just a special form of mount() system call that redirects the access to the destination path to the source path. It behaves like a mount(), because it still inserts an entry into the mount point table. chroot() simply moves the root directory “/” to a given path given in the argument. On the contrary, file systems designed for sharing a static image across container instances while allowing each instance to make its own private changes that are stored in “layers”, such as AUFS, incur significant overhead due to the complexity and copy-on-write.

Second, containerized processes are typically allocated different namespaces, which are private (per-container) name spaces for various system resource handlers, including PIDs, file system paths, network port numbers, and so on. By separating the namespaces of containerized processes, each process can only observe its own resource usage, while different processes could not see each other’s resource usage. In other words, it isolates processes in containers such that each of them would behave as if they were the only process running in the system. New namespaces can be created by the unshare() system call, which accepts arguments indicating the resource type. The paper observes that both namespace creation and cleanup incurs non-negligible overhead, and it is especially bad for a few particular resource types, such as IPC and network ports, mainly because of the global lock and RCU.

Lastly, containerized processes also need to be allocated control groups, or cgroups, for managing CPU, memory, and I/O resources. The standard procedure using cgroups within containers would be to create a new cgroup for every process, add the process into the cgroup, and on program exit, remove the process from the cgroup, and destroy the cgroup itself. The paper observes, however, that the first and last step are unnecessary, as cgroups can be easily cached in a pool. It is also observed that caching cgroups is 2 times faster than creating and destroying cgroups.

In addition to studies of Linux system calls, the paper also observes that the majority of Python code crawled down from Github only uses a small subset of all public libraries. Among these libraries, most of them only makes changes locally, and could co-exist under the same directory. This observation has two important implications. First, Python library import could be accelerated by maintaining a local cache of commonly used libraries. Due to the skew on library usage, the local cache need not be very large and can still support most of the library installation requests without contacting an external server and downloading all the libraries across the Web. Second, it is also possible to pre-import Python libraries into the interpreter to reduce library import cost during the cold start.

The paper then presents the overall design of the lean container that incorporates a few techniques for reducing the cold start latency and to increase container creation throughput. First, when the container is being initialized, the path tree is prepared by stitching four different directories into the same directory using the fast bind mounting system call. The four directories are the base system image, the Python library packages, the function source code, and a scratch directory for temporary files. The first three directories are mapped as read-only to avoid mutating the shared files (recall that bind mounting will not perform CoW; Instead, all modifications are directly reflected on the files which can be seen by all processes). Then the root of the process is switched to the aforementioned directory using the cheap system call, chroot(). Compared with conventional container approaches of using a separate namespace and using AUFS, the lean container avoids the expensive operation of copying the mount point table as part of the namespace creation operation, as well as the AUFS and its CoW overhead.

Second, the container’s management process, which is spawned upon the container’s creation, communicates with the API gateway, which is OpenLambda, using an Unix domain socket. The domain socket is just like a regular socket that acts as a channel of communication in which data and file descriptors can be transferred (if it is a file descriptor, then permission is also automatically granted from the sender to the receiver). A big advantage of domain sockets is that they do not use port numbers for identifications. Instead, they simply use the PID, which is already unique at the system level, and hence do not need any virtualization on the port number (which, if used, is often statically bound). This way, the new namespace does not need to virtualize the file system and network resources, which have been shown to demonstrate bad performance and scalability.

Lastly, as indicated earlier, control groups are not created and destroyed for each containerized process. Instead, the OpenLambda process maintains a pool of initialized cgroups, and pass the cgroup via the domain socket to the container’s management process. When the containerized process is about to exit, the cgroup is returned to the OpenLambda gateway, which is then returned to the pool waiting for the next allocation.

Independently from the above lean container optimizations, which are mainly focused on reducing containerization costs via more flexible usages of system calls and reduced isolation, the paper also proposes an addition approach for accelerating the runtime cold start cost using Zygote containers and caching. A Zygote container is a process that has already performed library imports, and can either be immediately used for executing a function, or further import more required libraries before executing a function. Zygote processes are forked to create new instances of the containerized process, such that the new instance is already initialized, reducing the overhead of initializing a container and the language runtime to the cost of a fork() system call (although in the actual implementation, two fork() calls are required due to the “one peculiarity of setns()”). The system always maintains a base Zygote with the initialized container and no imported library. The Zygote is the most generalized one, and can be forked to executed any function after importing the dependency of the function. In addition to the base Zygote, a lineage of Zygotes with different libraries being imported are also maintained. These cached Zygotes are created when a requested is received, and there is no existing Zygote in the cache that has the dependent libraries. In this case, a new Zygote is created by first forking the closest Zygote currently in the cache (i.e., has the most required libraries, but no library that is not required, to avoid malicious libraries), and then importing the missing libraries.