Lelantus: Fine-Granularity Copy-On-Write Operations for Secure Non-Volatile Memories

- Hide Paper Summary

Paper Title: Lelantus: Fine-Granularity Copy-On-Write Operations for Secure Non-Volatile Memories
Link: https://ieeexplore.ieee.org/document/9138980
Year: ISCA 2020
Keyword: NVM; CoW; Virtual Memory

Back

Highlight:

Allowing memory controller to punch cache line sized holes on a COW page to allow read direction to the source page. This enables fine-grained COW, reducing write traffic.
The metadata per-page is just a source page pointer plus per-cache line bit indicating whether the line is a hole On secure NVM systems this can be colocated with encryption counters with very little overhead.
This design creates dependencies between pages even after a seemingly full page copy (on OS perspective). The OS should therefore track such dependencies in order to correctly transfer data ownership between sharer pages when the source page is freed.

Questions

The design is not even remotely related to NVM and security. Essentially it just proposes adding extra per-page metadata bits to track copy status for every COW’ed page.
There are lots of writing inconsistencies and typos in the second half of the paper. For example, on section 4 page 7, left bottom, it is stated “three CoW commands in the memory controller: page copy, page phyc and page init.”, while on the same page, right column, table 2, the three commands become page_copy, page_phyc and page_free. Page_init is in fact never discussed, which I believe should be a typo of page_copy. On page 7, section 4B, it should be “stale data” not “stall data”. On page 8, section 4C, it should be “traverse”, not “transverse”. On page 9, second line of the left column, it should be “rely on”, not “relay on”.
I do not get why the reference counter is used when a page is reclaimed? I thought the ref counter is decremented
when COW happens? That means as long as there is no full sharing of the page as in conventional COW, the value of the ref counter should always be one. We should actually use the sharer list for the source page to determine whether there are outstanding references from other physical pages.

This paper proposes Lelantus, a hardware virtual memory optimization for Copy-On-Write (COW) operations. The paper has observed that COW is a commonly used technique by the OS to delay the actual copy of physical pages when forking a process. The page will remain in read-only mode until the first write by either of the two processes, at which time the page is duplicated, and the write is performed on the new copy. The virtual address mapping for the writing process is also updated to point to the newly copied page. On application level, the paper also identifies several important use cases of COW feature of the OS, including normal fork() system calls, memory allocation where newly allocated pages are mapped to a zero page and lazily populated on the first write, virtual machine deduplication, and application address space checkpointing.

The paper recognizes two performance related drawbacks of performing full COW on the first write. First, the first store operation into such a page will incur much larger performance overhead than the rest of writes, causing performance fluctuation, which makes it harder to predict the performance of write operations. This is particularly bad in time-sensitive applications with QoS guarantees or real-time systems where the predicatability of timing matters even more than the latency of operations. Second, if the page is only written sparsely, especially on systems with huge pages, the write amplification would be high, since most of the lines copied during COW will not be written later, causing excessive but unnecessary write bandwidth, and, even worse, shortened device lifetime if on NVM.

The paper proposes adding per-page and per-cache line metadata to help allieviating the overhead of COW. Fine-grained metadata is added to track two pieces of information for a destination COW page: (1) Whether each cache line in the page has been copied, i.e., whether they still refer to the source page’s read-only data; (2) The address of the source page. Note that under Lelantus’s model, the source and destination pages are no longer of equal status. Instead, after the logical duplication, one page is delegated as the source page, which is responsible for providing read-only data to readers of both pages. The other page is the destination page, which has “holes” punched on it. Read operations to these holes will be forwarded to the associated source page, while write operations will invoke cache line granularity COW.

We next describe the metadata format as follows. Lelantus requires one bit per cache line to track the copy status of the line, which takes 64 bits (8 bytes) per 4KB page. In addition, it also needs one page level pointer to track the source page, the size of which depends on address space size and actual memory size. In total, these two extra metadata fields take no more than 0.4% of total storage, which is negligible. The paper specifically points out that similar metadata overhead already exists for secure NVM systems, where per-cache line and per-page metadata are maintained. For example, under counter-mode encryption, each page has a 64-bit major counter and each cache line has a 7-bit minor counter. For a 4KB page, this configuration allows all metadata for the page to be packed in a 64 byte block, which can be efficiently accessed with only one bus transaction. We do not cover counter-mode encryption here, since the operation of counter-mode encryption is unrelated to Lelantus. The paper suggests that Lelantus can borrow one bit from the 7-bit per-cache line counter to form the 64-bit source page address, resulting in zero metadata overhead, at the cost of higher overflow rate of minor counters (which may, as a consequence, incur more frequent re-encryption of page data). In addition, minor counter value zero is reserved in Lelantus to indicate that the cache line is not currently present in the page, i.e., there is a “hole”, the access of which should fall through to the source page. The counter-mode encryption controller is also modified such that on initialization and when an overflow occurs, all minor counters should be set to one, instead of zero.

Lelantus is built into the memory controller as an accelerator of COW-related page copy. Three commands are provided to the user, which can be invoked by writing the command code and arguments to memory-mapped controller registers. The first command is page_copy, which performs logical COW by providing both the source and destination page numbers as physical addresses. The memory controller, on receiving this request, will initialize the metadata fields of the destination page by setting all minor counters to zero, and setting the source page address to the one in the parameter. The second command is page_phyc, which performs physical copy of lines that are not yet in the destination page from the source page. This command is invoked when the source page is to be freed, in which case all “holes” in the destination page must be filled by physical data transfer from the source to destination. After this command, the destination page is entirely by its own, the content of which no longer depends on another page. The command is executed by first scanning the per-line metadata bits, copying those that are not yet present, setting the counter value to one, and then clearing the source address pointer. The last command is page_free, which is invoked when a destination page is freed, which simply clears all metadata fields.

On a read access, if the request misses the cache, the memory controller will perform address translation by checking the minor counter. Note that on a secure NVM system, the counter read overhead is already present, which does not necessarily increase access bandwidth. If the minor counter of the page is zero, no decryption will happen. Instead, the memory controller reads the source page pointer by concatenating bits from all minor counters, and redirect the access to the source page. Note that since the metadata can be stored in an aligned 64-byte block, which is exactly the size of a bus transaction, this operation is very efficient since it at most reads memory once. To further reduce metadata overhead, a metadata cache can also be added to the memory controller to absort most of the matadata requests. For read operations, the data is just returned to the cache hierarchy. For writes, the memory controller needs to first issue bus read to the source page, return the data, and in the meantime write the data to the destination page as well as updating metadata to close the “hole”. If all lines are closed, the page is fully copied, which will no longer be affected by source page reclamation, as we will see later.

The destination page can no longer depend on the source page, when the latter is freed by the OS. This happens, for example, when a COW page is created by a fork()’ed child process, and the parent process just terminates. In this case, the OS reclaims the source page as part of the cleanup procedure. The paper proposes that when this happens, the OS should check the sharer page list of the reclaimed page, which should be maintained in the kernel data structure as a linked list. The sharer page list is updated whenever a new COW destination page is created by a child process using the COW page’s physical address. When a physical page is reclaimed, the OS simply checks this sharer list. If the list is not empty, then some COW pages are still referring to this page. In this case, the OS should physically perform copy from the page being reclaimed to one of the COW pages, and update the metadata for the rest (if any) such that they now use the former page as the source.

This design may also incur recursive address translation, i.e., when a source page is located, the corresponding cache line on the source page is still missing, suggesting that a second translation should occur. This happens when a forked process itself forks another process, and the latter writes a COW page of its direct parent, creating a second COW page. This phenomenon can recursively repeat itself, resulting in a chain of dependent pages. The memory controller handles this by recursively following the link upstream to the source page that contains the cache line to be read. This requires checking the minor counter’s COW bit for every cache line to be accessed even after the address translation.