Quartz: A Lightweight Performance Emulator for Persistent Memory Software
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=2814806
Year: Middleware 2015
Keyword: NVM; Simulator
This paper presents Quartz, a non-volatile memory simulator built on top of certain advanced hardware features. At the time of writing this paper, commercial product of NVM hardware is not yet available on the market. As a result, NVM researchers have to use simulators to evaluate their research. Although it is generally believed that NVM has similar performance characteristics as DRAM except for lower bandwidth and higher latency, this paper identifies several problems with existing NVM simulatoirs that can make them infeasible to use. The first problem is simulation speed. If we emulate the internals of NVM hardware (which is also unknown), the slowdown can be quite significant, which prevents researchers from running large scale experiments. The second problem is accuracy. If simulators are to be made fast, then very likely the they only simulate a subset of all desired characteristics, such as latency-only or bandwidth-only, which may alter application behavior and lead to wrong conclusions.
Compared with previously proposed simulators, Quartz simulates both latency and bandwidth at almost-native speed. On one hand, Quartz achieves high simulation throughput by running the application directly on hardware with no virtualization layer, and only intervening the execution at certain time points such as thread synchronization. On the other hand, lack of program monitoring does not lead to worse accuracy, since Quartz relies on advanced hardware features, such as thermal control on DRAM controller, to upper bound memory bandwidth to a lower value for simulating slower NVM device.
Quartz assumes that NVM reads and writes are symmetric, i.e. they have the same bandwidth and latency. This assumption works well in the context of this paper, since at the time of writing nobody has done any evaluation on actual hardware. Quartz also assumes that NVM has similar scalability as DRAM, which is also not true on real NVM hardware. As future work, the performance model of Quartz may be extended such that the extra latency is inserted based not only on the number of off-chip accesses, but also based on the ratio of reads and writes (collected off-line), and the number of threads concurrently accessing the device.
We next describe Quartz’s simulation technique for bandwidth and latency, respectively. In order to simulate lower bandwidth of NVM, Quartz leverages the thermal control capability of modern DRAM controllers. Thermal control on these controllers is achieved by upper bounding the total bandwidth of DRAM, which is programmable by a few memory mapped registers. By writing a value into the thermal control register, the memory bandwidth will be throttled depending the value written into the register. The paper confirms in the evaluation section that the throttled bandwidth grows linearly with the value written. The maximum bandwidth can be measured by spawning several threads and let each of them use non-temporal moves (such that the cache is circumvented) to saturate memory bandwidth. During initialization, Quartz computes the relation between the thermal control value and maximum achievable bandwidth using this technique, and then sets the thermal control register to a value that matches the intended bandwidth.
To simulate latency, the easiest way is to intercept every load and store in the program, and add a fixed latency (e.g. the difference between DRAM latency and speculated NVM latency) to it. This approach, however, has two problems. The first problem is that instrumenting every load and store will introduce non-negligible slowdown, which hurts performance and changes program bahavior. The second problem is that the majority of memory operations for most applications will hit the cache, which means that the traffic to NVM is actually filtered out. Simply adding a fixed for every memory access is not only unfair for those that hit the cache, but also over-estimates the latency of the system. To solve these problems, Quartz both leverages hardware performance counters to collect statistical data about off-chip memory accesses, and epoch-based latency simulation, which devides the execution into epoches, and only simulate accumulated latency within the epoch at the end of it. To achieve this, Quartz must compute the number of processor cycles that are actually wasted on waiting for DRAM to respond, and then scale this value to match the simulated latency of NVM by multiplying this value with the latency ratio between NVM and DRAM. The hardware performance counters used by Quartz are the number of cycles the processed has stalled waiting for L2 cache, the number of memory operation that hit the L3, and the number of memory operation that miss the L3. Note that on Xeon platforms, the number of memory operations that hit the DRAM (or the number of cycles waiting for DRAM) cannot be directly obtained. We therefore must derive it using several counters.
Thread synchronization also plays an important role in latency simulation. Without any thread synchronization, threads can overlap their latency of accessing the NVM perfectly, since they do not wait for each other. When synchronization exists, however, one thread waiting for another may not overlap its latency of accessing NVM with the latency of the latter, since its memory operations are serialized after the memory operations of the thread it has waited. Imagine the case where two threads are contending on the same lock. Thread 1 is currently in the critical section, and then releases the lock. Threads 2 acquires the lock after thread 1 releases it, completes the critical section, and releases the lock. In the meantime, thread 1 ends its current epoch and begins latency injection. Without epoch synchronization, thread 1 overlaps its NVM access latency simulation with thread 2’s execution of the critical section, while in reality, thread 2 must not acquire the lock until thread 1 finishes all NVM accesses during the critical section. To properly synchronize latency simulation, Quartz instruments thread synchronization primitives, and forces threads to end the current epoch and begin latency simulation before entering and exiting a critical section.
To simulate systems with both NVM and DRAM installed, Quartz must allow applications to access memory with different bandwidth and using different hardware counters. This is difficult with only one core, but easy to achieve with multi-socket machine. The observation is that multi-socket machines have independent memory controllers and hardware counters on both sockets. By mapping DRAM to local node, while mapping NVM to the remote node (the remote node must stay idle during the simulation). Quartz configure local memory controller and remote memory controller seperately to set different bandwidth upper bounds. Similarly, by reading remote performance counters, we are able to derive the number of memory accesses only on the NVM (local DRAM does not need any latency injection, since it runs at native speed) device.
Since no commercial NVM is available, Quartz is validated aginst a NUMA system where all memory references are made to the remote node. Since remote NUMA access has both higher latency and lower bandwidth, by running Quartz on a different machine where all memory references are mapped to the local node, and configuring the latency and bandwidth to match remote NUMA accesses, we can compare the final result of Quartz and the target machine by running the same latency and bandwidth test benchmark. Bandwidth is tested using non-temporal loads and stores which circumvent the cache. Latency is tested using a simple pointer chasing program, in which all pointer accesses are guaranteed to be cache misses. Since fetching the target of a pointer from memory is dependent on the memory value itself, there is no memory level parallelism, which makes the measurement more accurate. Results show that Quartz achieves very low simulation error, which is below 9 percent for all workloads.