An Empirical Guide to the Behavior and Use of Scalable Persistent Memory



- Hide Paper Summary
Paper Title: An Empirical Guide to the Behavior and Use of Scalable Persistent Memory
Link: https://arxiv.org/abs/1908.03583v1
Year: arXiv, Aug 9 2019
Keyword: NVM



Back

This preprint draft paper presents several performance characteristics of the newly introduced commercial product of byte-addressable NVM, 3D XPoint. The observations made in this paper are based on actual physical persistent NVM hardware, which is most likely implemented with Phase-Change Memory technology. Byte-addressable NVM requires special support from the integrated memory controller on the processor chip to ensure the correctness of stores once they reach the write pending queue of the memory controller. The memory controller leverages Asynchronous DRAM Refresh capability to guarantee that even on a power failure, dirty data in the write queue can still be flushed back to the NVM, and so does dirty data within NVM’s internal buffers and queues. Although the memory controller communicates with the NVM device using 64 byte cache line sized blocks, the internal page size of the NVM device is 256 bytes due to density requirements and physical space limites. As we will see later, this page size configuration will introduce write amplification problem in some cases, as write operations into an NVM page will be internally performed as a read-modify-write sequence. To reduce the frequency of writes into the non-volatile components, the NVM also provides an internal buffer (called XPBuffer in the paper) to combine writes. The buffer is also made persistent using residual powers on the device on a power failure. It is also suggested by the paper that there are 64 such buffers, 256 bytes each, in the device. Both reads and writes will allocate a buffer entry.

The paper presents two commonly used methods of persisting data to the NVM device. The first method is to issue a cached regular store, followed by a flush and memory fence. The memory fence will commit when the store reaches the write pending queue of the controller, after which the store is guarantee to persist. The second method is to use non-temporal stores to circumvent the cache hierarchy. The non-temporal store will likely be write combined in the processor’s write queue, and sent to memory controller directly without write-allocate. Since non-temporal writes are not ordered against other writes and even themselves, a memory fence must also be issued after the non-temporal store to ensure that later non-temorial srores do not override the current one. The paper suggests that using the second method to persiste large amount of data is slightly faster than using the flush + fence sequence. This is because non-temproal stores do not allocate cache entry for the write, and therefore, if the non-temproal store is able to fill an entire cache line after write combining, the cache controller does not have to read the address before writing into it. This prevents the NVM devices from being saturated prematurely by the extra read traffic, which is sometimes not intended (e.g. when we are writing log data which is barely read again until crash recovery).

The paper makes the following observation regarding access latency. First, read latency is significantly higher than write latency on NVM. This is not the case with DRAM, which has an almost symmetric read-write latency. This can be explained by the difference in the performance characteristics of underlying memory technologies, since reading from the PCM NVM is just much slower than reading from the DRAM capacitor cells. Even more surprisingly, the paper also observed that the read latency of NVM is around 3x higher than writing into the NVM, and that the latency of DRAM write and NVM write are alomost the same. This is because writing into the NVM only requires sending data to the memory controller, which buffers the request in the write pending queue, while data being read from the NVM is sent by the device itself. The second observation is that sequential reads are much faster than random read. This is the natural result of having internal buffers which also allocate entries for reads. Sequential reads are likely to hit the buffer, and hence have lower average latency, while random reads are likely to miss the buffer, which requires accessing the physical non-volatile component. The third observation is on variance of latencies. In general, 3D XPoint features extremely low overall write latency variance. There are, however, outliers for writes which have 100x higher latency than normal writes. The paper suspected that this is either casued by wear leveling, or as a result of thermal control.

The paper also presents experiment results on bandwidth. The first observation on bandwidth is that 3D XPoint NVM features lower read and write bandwidth compared with DRAM. In addition, the gap between read and write bandwidth on NVM is larger than the gap on DRAM. The second observation is that NVM bandwidth does not scale as the number of threads concurrently accessing the device increase. The paper suggests in a later sections that this is caused by the contention on two parts: memory controller and the device’s internal buffer. Since reads will allocate buffers in the NVM, as the number of independent reads from threads increase, locality drops such that more recent reads are likely to evict earlier reads, which then require fetching data from the non-volatile component, reduing overall read bandwidth. In fact, when the number of reading threads exceed 10, read bandwidth starts to decrease. Writes have a similar contention issue, but is on the memory controller side: As the number of independent writes arriving at the controller increase, the chance that writes can be combined decrease. Furthermore, as NVM can drain the write queue at a much lower rate than DRAM, especially when locality is generally low for multiple threads, the queuing effect can make writes suffer from larger latency as the number of threads increase.

The paper also conductes experiments of multi-threaded read and write bandwidth using the optimal thread count settings from the last experiment, while varing the access size. Our observation is that before the access size reaches 256 bytes, bandwidth keeps growing; After 256 bytes it simply plateaus. This can be explained by the presence of the internal buffer: Before access size reaches 256 bytes, which is exactly the size of the buffer, each access still needs to pay the extra cost of opening a page and loading it into the buffer (assuming random workload), which has fixed cost regardless of the size of access. In other words, we need to pay an fixed cost of accessing X bytes of data, regardless of the value of X. When X increases, bandwidth will also increase. The second observation is that when devices are interleaved (e.g. 4KB in the paper), when the write size is close to the interleaved size, write performance becomes worse, resulting in a “dip” in the bandwidth graph. This is because as the access size reaches the stride size, the probability of distributing the write to two memory controllers increase (when the access size is exactly the stride size, writes will always be sent to two memory controllers, except when the write is aligned on the size boundary). This effect increases contention on the memory controller size, which has limited queue capacity, since one write request now generates two or even more requests on the memory controller, reducing the overall processing throughput for requests.

The paper also gives several advices on NVM programming based on the above observations. The first advice is that programmers should avoid small and random writes to the NVM, which could not be handled very well by the internal buffer, and will cause write amplification problem for writes. The paper proposes using Effective Write Ratio (EWR) is the standard of measuring the degree of locality. EWR is the result of dividing the amount of writes (in size) from the CPU by the amount of physical writes performed to the NVM non-volatile component (collected from device’s performance counter). EWR is an indication of write amplification, since the smaller EWR is, the larger write amplification effect will be. For sequential writes, the paper reports a EWR of near 1.0 (it can even exceeds 1.0 if writes are combined in the buffer). For small writes with good locality, the EWR is also moderately high, illustrating the effectiveness of the internal buffer. When the distance between two highly related writes (e.g. first one writes the first 128 bytes of an 256 byte aligned array, and second one writes the second 128 bytes of the array after certain number of unrelated writes) exceeds 64, EWR becomes significantly worse, which subtly suggests that it is likely that there are 64 internal buffer entries within the device.

The second advice is to use non-temporal stores for larger writes. As discussed earlier, non-temporal writes can save read traffic by not having to allocate space for a line before writing it, when the size of the write is large enough to “blind write” an entire line on the underlying hardware (otherwise the read is still needed). On the NVM device with 256 bytes of storage, when the write size exceeds 512 bytes, we know that one 256 bytes page do not need to be read, since we can always blind write into it. For smaller writes, the paper suggests that cache lines should be flushed or written back as soon as writes finish, since this can achieve with better locality compared to only flushing cache lines at the end and having the cache controller evicting lines in an unpredictable manner. This is especially true when the size of the working set exceeds the size of the cache, in which case part of the working set will definitely be evicted by the controller, causing lower EWR.

The third advice is to limit the degree of concurrency on both the device and the memory controller. Due to the fact that NVM devices are sensitive to the locality of workloads, as the number of threads increase, the effectiveness of having internal buffers begins to drop, since independent threads are likely to access addresses that have low locality, causing thrashing on the internal buffer. In addition, when the size of the access is close to the size of strides for interleaved NVM devices, it is also likely that one request will be sent to two or more controllers, doubling the effective bandwidth on the controller side.