Janus: optimizing memory and storage support for non-volatile memory systems
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?doid=3307650.3322206
Year: ISCA 2019
Keyword: NVM; Encryption; Backend Memory Operation (BMO)
This paper presents Janus, a framework for efficient handling of Backend Memory Operations (BMO) on Non-Volatile Memory (NVM) controllers. In the introduction section, the paper identifies a major bottleneck of NVM operations in addition to loads: stores and the correspoinding BMOs. Two factors contribute to this conclusition. First, in order to achieve crash persistency and recovery, most NVM applications use some form of logging or shadowing. These techniques enforce write ordering between the log and actual data. For example, in undo logging scheme, the undo image is generated before the data is written into the cache line, and then written to the NVM to avoid the unrecoverable case in which a dirty cache line is evicted to the NVM before the corresponding undo image does. Write ordering incurs extra latency on store operations, which also blocks the processor pipeline, constituting a critical path. Second, most NVM devices also provide capabilities beyond persistence. For example, encryption, data compression, and deduplication are all important features for NVM device. These features require that the NVM device perform computations such as hashing, table lookup, key generation, etc. on the backend (i.e. BMOs) when a cache line is evicted or flushed from the processor, which can further add to the latency of store operations, aggravating the store bottleneck.
Janus employs two techniques to reduce the latency of BMO: parallization and pre-computation. With parallelization, some operations from different tasks can be performed at the same time, as long as they are independent from each other. This paper identifies four types of dependence: inter-operation dependence, intra-operation dependence, address dependence, and data dependence. The first two types of dependence dictates which operation within each task can be overlapped without affecting the result. In the paper’s example, inter-operation dependence exists between every two consecutive steps of a task, while intra-operation dependence can be identified by examining the task’s workflow. In this example, the table lookup operation in deduplication is the pre-requisite of XOR operation in block encryption, because if the block of the same content already exists on the NVM, no extra write will be conducted, and hence the encryption is no longer needed. Address and data dependence are input dependence resolved by prividing the address of the operation of the data. For example, during data encryption, the address is needed at the first step (computing the cipher key), but data is only used to generate the final output at the last XOR step. In practice, the processor pipeline computes the address of store operations before the cache line is acquired using cache coherence. The cache may send the address to the NVM controller once it is available in the address genetation unit, and only send data when the coherence transaction finishes, essentially overlapping the coherence transaction and the first two steps of data encryption. This technique, called pre-computation, helps reduce the latency of store operations. As long as the input dependence is satisfied, the BMO can be executed in the background by the NVM controller which overlaps with in-cache data operation.
Janus consists of two parts: the hardware component for enforcing dependence and storing intermediate results, and the software interface that exposes these capabilities to the programmer. Janus assumes a transactional model, in which all NVM stores are performed as part of a transaction, and committed to the NVM device at the end. This paper does not specify how inter- and intra-operation dependences are enforced (we can assume the hardware has some hardwired knowledge about these two operations), and only concentrate on storing intermediate results. In Janus, every store operation is assigned a globally unique identifier, which consists of three subfields: request ID, thread ID, and the transaction ID. If the store instruction is tagged as “requesting pre-computation”, then this identifier, together with the physical address of the store operation will be sent and entered into a buffer, called the Intermediate Result Buffer (IRB), in which the intermediate of BMOs are kept. Although not explicitly mentioned, the IRB should be co-located with the NVM controller that performs BMOs. Once the pre-computation is finished, its result will be entered into IRB also, such that the remaining steps of BMO can be resumed when data becomes available. Data can be sent to the IRB as soon as they are acquired by the coherence protocol, instead of having to wait until a cache line eviction or flush. The actual write operation, however, must only be performed when the processor indicates so to avoid the result being invalidated by other instructions.
In addition to intermediate states of BMOs, the IRB also tracks data dependencies of each BMO, and aborts the operation if the data dependency changes (i.e. results are invalidated). Data dependency could change in two aspects. First, if the processor sends the cache line to the IRB before the line is actually evicted or flushed, it is possible that a later store instruction further modifies the line, hence changing the content. In this case, the pre-computed intermediate result is no longer valid, and must be re-calculated using the new line. In the second case, no direct data dependency exists between the result and the data, but instead, the result may be related to another entry in the IRB or to the metadata used by BMO algorithms. If another entry is invalidated, or if the metadata is modified by other BMOs, the affected IRB entry must also be invalidated.
The software component of Janus has a class interface. The class data members track the unique identifier of the persistent store, as well as the address and data. Programmers are also given the choice of immediate and delayed pre-computation. For immediate pre-computation, the request will be sent to the NVM controller and IRB when the programmer indicates so. Any later alternation of the address or data will cause an invalidation, as described in the previous paragraph. Alternatively, the programmer may also choose to issue a delayed request, which will not be immediately sent to the NVM controller, but instead, it will be buffered in a request queue. Any later requests on the same address will be coalesced by the queue, which avoids extra overhead of invalidation. Only when the programmer “fires” the request, could it be sent to the NVM controller for processing. On structural hazards of both the request buffer and the IRB, the incoming request will be discarded silently. This does not affect correctness, but only the performance.