Crash Consistency in Encrypted Non-Volatile Main Memory Systems
- Hide Paper Summary
Link: https://ieeexplore.ieee.org/document/8327018
Year: HPCA 2018
Keyword: NVM; Encryption; Counter Atomicity
This paper introduces counter-atomicity, a concept used for encrypted NVM environment. The paper builds upon the fact that data stored in NVM devices should be encrypted, because otherwise data can be accessed even after a system shutdown, rendering memory protection meaningless. This paper levarages counter mode encryption, in which each cache line is tagged with a run-time generated counter. On every modification of the line, a new counter is generated by incrementing a system-wide hardware counter. Then a one-time padding is generated as a function of the counter. This process does not require cache line data, which is one of the reasons counter mode encryption can be implemented efficiently. After generating the one-time padding, it is XOR’ed with the content of the line, which concludes the encryption process. Both the counter and the cache line are stored in the NVM in two different address spaces. On NVM read operation, both the counter and data are read from their storage locations. The one-time padding is generated using the same function as in the encryption phase, and then XOR’ed with the cache line data read from the NVM device. Since two XOR operations with the same mask will cancel out with each other, the original data can be recovered this way. Note that without any optimization, this process will increase the latency of read operations, because the generation of one-time padding and XOR operation both adds to the critical path. To reduce latency, the paper suggests that a counter cache can be added, which is accessed on every counter read and update operation. Since the counter cache has lower latency, the NVM controller could partially or fully parallelize padding generation and data read from NVM, reducing the extra read latency to a minimum.
One of the chanllenges of using counter mode encryption identified by the paper is about the consistency between cache line data written back to the NVM and the counter value. On current hardware platform, only the atomicity of one single write is guaranteed by the hardware. By contrast, with counter mode encryption, two NVM writes are generated per store: one for the regular data write, another for updating the counter map on the NVM. Without proper handling, this may lead to data corruption which renders NVM data unreadable. The paper uses the example of inserting a new node at the head of a linked list. In this example, if the system crashes between the head pointer cache line update and the counter update, then it is impossible to recover the correct address of the new node, since the counter has not been updated on the NVM, which corrupts the pointer value.
The paper proposes two solutions. One straightford solution is to co-locate the counter and data, such that they constitute an atomic pair which can be transferred on the memory bus. To accommodate this change, the width of the memory bus must also be extended such that 72 bits (64 + 8) are sent per transfer. This design has two severe problems. First, extending the memory bug width is rather impractical on today’s architecture, since that implies adding more pins to the memory chip, which is expensive and sometimes unrealistic. Second, this still does not solve the read latency problem, because in order to overlap cache line read and counter read, they must reside in different address spaces. Co-locating both in the same unit will waste too much storage.
The second proposal is to add a separate queue for counters in addition to the write queue that already exists on modern processors. These two queues are backed by residual powers on the NVM chip, which allows them to flush volatile data into the NVM in case of a power failure. On eviction of a dirty cache line, the newly generated counter as well as the dirty line are inserted into the corresponding queue. A “ready” bit is set for both entries if both are available in the queues (i.e. the paper suggests that when a counter is inserted, the counter queue controller should check cache line queue whether the line on the same address has already been inserted. If positive, the “ready” bits in both entries will be set). On power failure, only entries with the “ready” bit on will be flushed, and the rest be discarded. Note that the hardware does not need to insert the counter and dirty cache line at the same moment. In fact, to achieve better performance, the design explicitly allows some cache line and counter writes be non-atomic, as we will discuss below.
This paper also makes an observation that is critical for achieving high performance with counter atomic architecture: Not all stores are required to be counter atomic to ensure recoverability. In some use cases, only the store that makes private data public, or affects the recoverbility of the data. In the above linked list example, the stores that write to the new node does not need to be counter atomic before the store that updates the head pointer (after this point, counter-atomicity must be enforced, of course). This allows the hardware to reorder and coalesce stores for better write performance in the window starting from node creation to head pointer update. Another prominent example is logging. In all logging schemes, two copies of data under modification are maintained, and only one copy is updated during the transaction (for undo logging, we update data in-place; for redo/shadow logging, we update the redo log entry and leave data untouched). The important observation is that if the system crashes in the middle of a transaction, the copy of data that is under update will be discarded, and no impact will be made if this part of the memory is inconsistent. This poses a great opportunity in logging systems, because according to the observation, no counter atomicity needs to be maintained for cache lines updated during the transaction until the transaction commits, at which time the updated data will become the master copy.
As an ISA extension, the paper suggests adding a “flush counter cache” instruction as a counterpart of the clflush which flushes data cache. At the end of the counter atomicity window, the programmer should issue both clflush and counter cache flush instructions in order to correctly persist both states to the NVM.